kvserver: improve below-Raft migrations

erikgrinaker commented 2 years ago

Long-running migrations can send a MigrateRequest for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.

https://github.com/cockroachdb/cockroach/blob/455cdddc6d75c03645f486b22970e5c6198a8d56/pkg/kv/kvserver/replica_write.go#L254-L257

This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the postSeparatedIntentsMigration uses 5 retries for a given range and then fails the entire migration, having to restart:

https://github.com/cockroachdb/cockroach/blob/4df8ac262e72c58cb4e09dcf1beb0b8ff6fdde27/pkg/migration/migrations/separated_intents.go#L557-L563

This could be improved in several ways:

[x] ~Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.~
[ ] Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
[ ] Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
[ ] Use smaller txns (with low priority) when iterating over full set of range descriptors to reduce contention over meta2. Since long running migrations are long running, locking up meta2 for the entire period is not ideal.
[x] Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
[x] Make the application timeout configurable, or simply rely on the passed client context (which would be controlled by the new migration infrastructure mentioned above).
[x] Improve the UX by failing with an informative error message explaining that the migration cannot be completed because range X replicas Y,Z are unavailable/behind/uninitialized/etc, and why all replicas must go through the migration before the upgrade can be finalized.

Jira issue: CRDB-11351

Epic CRDB-39898

irfansharif commented 2 years ago

Thanks for filing. Here's the internal thread that prompted this issue.

ajwerner commented 2 years ago

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

https://github.com/cockroachdb/cockroach/blob/b57af7d479a8f73be52fbd6a362e70d477f0ea7f/pkg/kv/kvserver/replica_command.go#L790-L821

ajwerner commented 2 years ago

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

erikgrinaker commented 2 years ago

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

Well yes, that's what we currently do here:

https://github.com/cockroachdb/cockroach/blob/455cdddc6d75c03645f486b22970e5c6198a8d56/pkg/kv/kvserver/replica_write.go#L206-L259

The problem is that this command has to succeed for every range on every replica in one go (with a few tight retries), otherwise the entire migration fails and has to restart. This can be a problem in large clusters with many ranges (in this case, 400.000 ranges).

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

No, just something we should verify.

irfansharif commented 2 years ago

One thing we're sorely lacking is an integration/qualification/acceptance test to run these migrations at very large scales. It could've shaken out some of these low timeouts for e.g. or further validated the need for generic checkpointing.

irfansharif commented 2 years ago

Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.

This is worth doing, perhaps as part of https://github.com/cockroachdb/cockroach/issues/84073.

Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here:

https://github.com/cockroachdb/cockroach/blob/ea13a52d0c88f5c6a7deb3770b2d2cf5ff1cb4db/pkg/upgrade/upgrades/raft_applied_index_term.go#L40-L46

ajwerner commented 2 years ago

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here

That linked library should not be particularly difficult to generalize.

knz commented 1 year ago

Would it make sense to also lower the txn priority inside IterateRangeDescriptors (and possibly the implicit txn used for the MigrateRequest), to reduce the visible contention on meta2 from SQL?

erikgrinaker commented 1 year ago

I wouldn't use a transaction at all, and instead scan smaller batches of up-to-date descriptors with individual scan requests. No objection on doing these with low priority, but we specifically don't want a snapshot of meta2, we want fresh data.

irfansharif commented 1 year ago

Using smaller txns (and lowering their priority if still needed) to fetch batches of range descriptors makes sense.

irfansharif commented 1 year ago

I'll apologize for not having done just all this originally, this simple thing has caused me much grief. Had I kicked the tires more on realistic cluster sizes (100k+ ranges with ambient split/merge activity), all this would've been much more apparent.

knz commented 1 year ago

Irfan, using smaller txns with lower priority is not mentioned on the issue erik linked. Mind filing a new followup issue for that (separate) change?

irfansharif commented 1 year ago

Do you mean this issue? I've added a bullet point here.

knz commented 1 year ago

thanks!

erikgrinaker commented 1 year ago

Since we'll be adding a fair bit of concurrency here to improve throughput for trivial migrations, we should probably integrate this with AC somehow as well, to avoid expensive migrations overloading the cluster.

erikgrinaker commented 9 months ago

Adding O-support, since we've had several escalations about below-Raft migrations stalling upgrades.

cockroachdb / cockroach

kvserver: improve below-Raft migrations #72931