cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

backupccl: ensure each ExportRequest returns on a range boundary #114229

Closed adityamaru closed 9 months ago

adityamaru commented 10 months ago

When a backup sends ExportRequests it does so with a TargetBytes set to 1 to ensure that only a single SST file is returned in the response. This was also what we subconsciously relied on to limit the request only going to a single range. In #2554 we saw a unique situation where the span covered by the ExportRequest spanned several ranges and the first several ranges were empty and so a single client call led to ~50 DistSender Send calls. Each ExportRequest is given 5mins by default to complete, and this timeout was added under the impression that the request would only hit one range. Granted processing an empty range should be very quick but when you have ~6 concurrent workers sending these requests to a store during a backup you can quickly clog up the concurrent request limiter. This is likely what caused backups to frequently fail in the issue linked above.

When we plan a backup, before executing the flow, we get a span -> node mapping where each span belongs to a single range and does not cross range boundaries. In the backup processor however we lose this per range chunking as we collapse these spans such that they could span range boundaries when sent as an ExportRequest. We should maintain the contract that a single client side ExportRequest should only touch one range, even if that range ends up being empty. I see two options here:

Jira issue: CRDB-33408

blathers-crl[bot] commented 10 months ago

cc @cockroachdb/disaster-recovery

blathers-crl[bot] commented 9 months ago

Hi @dt, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

dt commented 9 months ago

Completed and backported (default off) to 23.1.