Open irfansharif opened 2 years ago
@irfansharif once #80607 is done (and also with https://github.com/cockroachdb/cockroach/pull/86267), we would not need this change, right?
I'm not sure how #86267 is related. But yes, the proximate reason for this issue is the lack of https://github.com/cockroachdb/cockroach/issues/80607 which is more involved (I think) than what this issue proposes.
My point is that I think the solution to the problem described here should be #80607 and not the one proposed here. Maybe you think we should implement both? it's not clear to me that we should. We can discuss offline.
affecting multi-store clusters most prominently given our per-store rate limits
Once the work for disk bandwidth and CPU control over incoming range snapshots is complete, the multi-store concerns mentioned above should not affect foreground latency, since we may start allowing too many incoming range snapshots, but will pace them once started.
Is your feature request related to a problem? Please describe.
We've observed incidents (internal link) where a barrage of snapshots after adding a new node be enough to affect foreground latencies (primarily through saturating disk bandwidth through snapshot writes; affecting multi-store clusters most prominently given our per-store rate limits, or due to CPU intensive compaction work on recipient node). We've also seen that keeping that new node in the draining state (not accepting leases) helps reduce foreground latency impact.
Describe the solution you'd like
Avoid transferring leases to new nodes until they're sufficiently "caught up". Maybe starting them off explicitly in the drained state before marking them as ready to hold leases (EDIT: but drained state also prevents upreplication, so maybe that's not usable here, unless we change its semantics like in #87969). Aside: we don't have a way to mark a node as "undrained" (only restarts achieve this); that too would be worth packaging up in the CLI for operators.
Describe alternatives you've considered
80607 tracks the more general mechanism of subjecting snapshot ingestion to admission control, and would make this issue redundant. It's a slightly more involved thing to work through, so this approach is worth trying in the interim.
Additional context
There are other ways to saturate disk bandwidth through snapshots, like decommissioning nodes and having that trigger an unsustainable rate of snapshot ingestions on some recipient node. #80607 will help with that, and is out of the scope here.
Jira issue: CRDB-19565
Epic CRDB-41111