kv: model decommissioning/upreplication as a global queue of work

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

Decommissioning is slow.

Background

We're introducing system-wide benchmarks (https://github.com/cockroachdb/cockroach/pull/81565) and improving per-store queueing behaviour (https://github.com/cockroachdb/cockroach/pull/80993 + https://github.com/cockroachdb/cockroach/pull/81005), which will help identify bottle necks and address one of them. One likely one is conservative snapshot rates (https://github.com/cockroachdb/cockroach/issues/14768 + https://github.com/cockroachdb/cockroach/issues/63728), introduced pre-admission control and chosen conservatively to not overwhelm storage nodes; here too we have ideas around how to make these rates more dynamic while still preserving store health (https://github.com/cockroachdb/cockroach/issues/80607 + https://github.com/cockroachdb/cockroach/issues/75066). Another recent body of work has been around generating snapshots from followers (https://github.com/cockroachdb/cockroach/issues/42491), which for us presents as more potential sources/choices to upreplicate from during decommissions.

Current structure

High-level view of how decommissioning works:

We flip a bit on liveness record marking a node as decommission-ing;
Individual stores learn about this bit flip (the record is gossiped) and for ranges where it has the leaseholder, attempts to move a replica away from the decommissioning node to another;
Once the node being decommissioned has no more replicas, we mark the node as fully decomission-ed thus excluding it from further participation in the cluster.4.

Step (2) is the slowest part, and to try and formalize how long it's going to take:

Let R0 be be the set of ranges with replicas on the decommissioning node
R0 = R0_S1 + R0_S2 + … where R0_SN is a range with a replica on the decommissioning node + a snapshot sender (not necessarily a leaseholder) on node N
Assuming maximal sender side parallelism + no receiver side queuing, time to send all snapshots = max(bytes(R0_S1), …, bytes(R0_SN))/snapshot send rate (could also have per-R0_SN send rates).

This tells us that to go as fast as possible, you want minimize the snapshot bytes generated by the node sending the maximum number of bytes. For completeness, to understand receiver side behaviour:

Let R0 be be the set of ranges with replicas on the decommissioning node, snapshots for which need to be received
R0 = R0_R1 + R0_R2 + … where R0_RN is a range with a replica on the decommissioning node that will be moved to node N because of decommissioning
Assuming maximal receiver side parallelism + no sender side queuing, time to receive all snapshots = max(bytes(R0_R1), …, bytes(R0_RN))/snapshot receive rate (could also have per-R0_RN receive rates)

Which tells us we want to minimize the number of bytes received by the node receiving the maximum number of bytes. The overall decommissioning time is then max(time to receive all snapshots, time to send all snapshots).

Proposed structure / the solution you'd like

Looking at the above, we’re relying on uncoordinated snapshot generation per-store targeting whatever destination with little visibility on receiver side snapshot queuing. This can have bad tail properties (something perhaps https://github.com/cockroachdb/cockroach/pull/81565 helps confirm). I wonder if basic load balancer ideas apply here: we have a global queue of work to be done (send some snapshot from the set R0 to the least utilized receiver) that every sender can pull from, instead of trying to coordinate independently. I assume this becomes more pressing once we have more sources for snapshots (i.e. followers).

Additional context

See linked issues in the Background section. We're also interested in improving observability https://github.com/cockroachdb/cockroach/issues/74158. One idea here is to do it by structuring decommissioning as a job: https://github.com/cockroachdb/cockroach/issues/74158#issuecomment-1147685254. In addition to other benefits, it gives us a place to maintain this global queue + orchestrate.

Jira issue: CRDB-16412

andrewbaptist commented 2 years ago

One thing to note is that the number of nodes to send is not uniform. On the decommissioning nodes, it must send R0/C while on the other nodes each sends R0((C - 1)/C)/N (C is replicas per range, and N is nodes in the system). The intuition behind this is that only the ranges that overlap with the decommissioning node need to be moved. Since the node being decommissioned by definition overlaps with all the ranges on it, and the other nodes overlap with less, they move a lot less.

This can be addressed by first running a drain command. After a drain command, each node other than the decommissioned node will have more replicas to move, however, they will all have a similar number R0/(N-1) which is generally going to be much less than the R0/C that would need to be moved per node otherwise.

github-actions[bot] commented 8 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

kv: model decommissioning/upreplication as a global queue of work #82475