irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

We've observed incidents (internal link) where a barrage of snapshots after adding a new node be enough to affect foreground latencies (primarily through saturating disk bandwidth through snapshot writes; affecting multi-store clusters most prominently given our per-store rate limits, or due to CPU intensive compaction work on recipient node). We've also seen that keeping that new node in the draining state (not accepting leases) helps reduce foreground latency impact.

Describe the solution you'd like

Avoid transferring leases to new nodes until they're sufficiently "caught up". Maybe starting them off explicitly in the drained state before marking them as ready to hold leases (EDIT: but drained state also prevents upreplication, so maybe that's not usable here, unless we change its semantics like in #87969). Aside: we don't have a way to mark a node as "undrained" (only restarts achieve this); that too would be worth packaging up in the CLI for operators.

Describe alternatives you've considered

80607 tracks the more general mechanism of subjecting snapshot ingestion to admission control, and would make this issue redundant. It's a slightly more involved thing to work through, so this approach is worth trying in the interim.

Additional context

There are other ways to saturate disk bandwidth through snapshots, like decommissioning nodes and having that trigger an unsustainable rate of snapshot ingestions on some recipient node. #80607 will help with that, and is out of the scope here.

Jira issue: CRDB-19565

Epic CRDB-41111

lidorcarmel commented 2 years ago

@irfansharif once #80607 is done (and also with https://github.com/cockroachdb/cockroach/pull/86267), we would not need this change, right?

irfansharif commented 2 years ago

I'm not sure how #86267 is related. But yes, the proximate reason for this issue is the lack of https://github.com/cockroachdb/cockroach/issues/80607 which is more involved (I think) than what this issue proposes.

lidorcarmel commented 2 years ago

86267 is related because with that PR, during decommission, replicas will not prefer the new node (you mentioned decommission above).

My point is that I think the solution to the problem described here should be #80607 and not the one proposed here. Maybe you think we should implement both? it's not clear to me that we should. We can discuss offline.

sumeerbhola commented 1 day ago

affecting multi-store clusters most prominently given our per-store rate limits

Once the work for disk bandwidth and CPU control over incoming range snapshots is complete, the multi-store concerns mentioned above should not affect foreground latency, since we may start allowing too many incoming range snapshots, but will pace them once started.

cockroachdb / cockroach

kvserver: don't transfer leases to new nodes until upreplicated #87847

80607 tracks the more general mechanism of subjecting snapshot ingestion to admission control, and would make this issue redundant. It's a slightly more involved thing to work through, so this approach is worth trying in the interim.

86267 is related because with that PR, during decommission, replicas will not prefer the new node (you mentioned decommission above).