cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

cli: improve `debug zip` rangelog collection #89909

Open erikgrinaker opened 2 years ago

erikgrinaker commented 2 years ago

In support escalations, I usually find that the rangelog contained in the debug.zip has already rotated out the interesting bits. We should reconsider the default retention policy here, since this can be very useful for debugging.

Jira issue: CRDB-20495

Epic CRDB-32134

ajwerner commented 2 years ago

@rafiss this is another opportunity to potentially leverage TTL code.

erikgrinaker commented 2 years ago

Looks like we already have 30 days of retention, which should be plenty:

https://github.com/cockroachdb/cockroach/blob/b8eee78e455ad97f85f6a74c22732d3c5f3ee349/pkg/server/server_systemlog_gc.go#L36-L47

The problem is that the debug.zip tends to time out when collecting system.rangelog.txt. And the rangelog.json file is limited to 1000 events:

https://github.com/cockroachdb/cockroach/blob/c7ff5ab70f9200c21df0210399fd5a2ec3719118/pkg/server/admin.go#L1551-L1553

I think the simplest solution might be to extend the timeout for system.rangelog.txt such that we can dump the entire thing. rangelog.json will be more work in that we'd have to add pagination and such. If size becomes an issue, exporting the last 7 days (rather than 30 days) would likely be sufficient.