Open irfansharif opened 2 years ago
One thing to note is that the number of nodes to send is not uniform. On the decommissioning nodes, it must send R0/C
while on the other nodes each sends R0((C - 1)/C)/N
(C is replicas per range, and N is nodes in the system). The intuition behind this is that only the ranges that overlap with the decommissioning node need to be moved. Since the node being decommissioned by definition overlaps with all the ranges on it, and the other nodes overlap with less, they move a lot less.
This can be addressed by first running a drain
command. After a drain command, each node other than the decommissioned node will have more replicas to move, however, they will all have a similar number R0/(N-1)
which is generally going to be much less than the R0/C
that would need to be moved per node otherwise.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Is your feature request related to a problem? Please describe.
Decommissioning is slow.
Background
We're introducing system-wide benchmarks (https://github.com/cockroachdb/cockroach/pull/81565) and improving per-store queueing behaviour (https://github.com/cockroachdb/cockroach/pull/80993 + https://github.com/cockroachdb/cockroach/pull/81005), which will help identify bottle necks and address one of them. One likely one is conservative snapshot rates (https://github.com/cockroachdb/cockroach/issues/14768 + https://github.com/cockroachdb/cockroach/issues/63728), introduced pre-admission control and chosen conservatively to not overwhelm storage nodes; here too we have ideas around how to make these rates more dynamic while still preserving store health (https://github.com/cockroachdb/cockroach/issues/80607 + https://github.com/cockroachdb/cockroach/issues/75066). Another recent body of work has been around generating snapshots from followers (https://github.com/cockroachdb/cockroach/issues/42491), which for us presents as more potential sources/choices to upreplicate from during decommissions.
Current structure
High-level view of how decommissioning works:
Step (2) is the slowest part, and to try and formalize how long it's going to take:
R0
be be the set of ranges with replicas on the decommissioning nodeR0 = R0_S1 + R0_S2 + …
whereR0_SN
is a range with a replica on the decommissioning node + a snapshot sender (not necessarily a leaseholder) on nodeN
time to send all snapshots = max(bytes(R0_S1), …, bytes(R0_SN))/snapshot send rate
(could also have per-R0_SN
send rates).This tells us that to go as fast as possible, you want minimize the snapshot bytes generated by the node sending the maximum number of bytes. For completeness, to understand receiver side behaviour:
R0
be be the set of ranges with replicas on the decommissioning node, snapshots for which need to be receivedR0 = R0_R1 + R0_R2 + …
whereR0_RN
is a range with a replica on the decommissioning node that will be moved to nodeN
because of decommissioningtime to receive all snapshots = max(bytes(R0_R1), …, bytes(R0_RN))/snapshot receive rate
(could also have per-R0_RN
receive rates)Which tells us we want to minimize the number of bytes received by the node receiving the maximum number of bytes. The overall decommissioning time is then
max(time to receive all snapshots, time to send all snapshots)
.Proposed structure / the solution you'd like
Looking at the above, we’re relying on uncoordinated snapshot generation per-store targeting whatever destination with little visibility on receiver side snapshot queuing. This can have bad tail properties (something perhaps https://github.com/cockroachdb/cockroach/pull/81565 helps confirm). I wonder if basic load balancer ideas apply here: we have a global queue of work to be done (send some snapshot from the set
R0
to the least utilized receiver) that every sender can pull from, instead of trying to coordinate independently. I assume this becomes more pressing once we have more sources for snapshots (i.e. followers).Additional context
See linked issues in the Background section. We're also interested in improving observability https://github.com/cockroachdb/cockroach/issues/74158. One idea here is to do it by structuring decommissioning as a job: https://github.com/cockroachdb/cockroach/issues/74158#issuecomment-1147685254. In addition to other benefits, it gives us a place to maintain this global queue + orchestrate.
Jira issue: CRDB-16412