Open stevendanna opened 3 weeks ago
Linking two other known catch-up scan issues here for consolidation: https://github.com/cockroachdb/cockroach/issues/129049 (memory accounting is missing for catch up scan) and https://github.com/cockroachdb/cockroach/issues/125953 (catch up scan not watching for disconnection signal properly).
Describe the problem
This is a high level tracking issue tracking confirmed and suspected problems with CatchUp scans. Many of the problems listed here are not necessarily bugs and some have only been speculated about but not confirmed to be a real problem in production. Items here without related issues have not yet been confirmed as real-world problems.
Iterator semaphore starvation
Long-tail behaviour caused by MuxRangeFeed usage with many replicas
A single MuxRangeFeed call may produce rangefeed registrations across hundreds to thousands of replicas on a node. When one such registration is in catch-up it is competing both for the outbound stream connection and for resources on the consumer with any caught up registrations.
This could possibly cut in two different directions:
1) We want to finish the catch-up scan very quickly, not finishing the catch-up scan is costly. Thus, perhaps it would be better if the catch-up scan had priority in the context of competing updates from a given mux rangefeed call.
2) If any of the caught-up registrations overflow because a catch-up scan (which may produce datums in a much hotter loop than the caught-up registrations), they will be forced to do a catch-up scan.
Lack of cost-aware queueing for catch up scans
A catch-up scan with a start timestamp in the recent past is often much cheaper than a catch-up scan from hours ago. However, all catch-up scans compete for the same semaphore. This could result in a catch-up scan that would have been nearly free to run immediately increasing in cost as it waits on more costly requests.
Workload differences between steady state and catch up scans
When a registration is caught up, the distribution of keys through time is driven by the SQL workload's distribution of updates on the given range. During a catch-up scan, this distribution may change substantially as catch-up scans iterate linearly through the keys of a range. In consumers that shard or batch work based on key, this may result in a consumer that has considerably lower throughput during a catch-up scan.
Persistent overflow condition during catch up scan Note that we have yet to see this in production. However, it is persistently a source of speculation during investigations and it would be nice to have metrics and/or a fix so that it could be conclusively ruled in or out.
Lack of checkpointing in the case of early exit
Similar to the above, if a catch-up scan exits for any reason, it currently loses all progress. Resulting in more wasted work.
Catch-up scans aren't memory monitored
Catch-up scans hold long-lived iterators
Other bugs of potential interest:
Jira issue: CRDB-43776