Open erikgrinaker opened 2 years ago
Thanks for filing. Here's the internal thread that prompted this issue.
We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.
Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
This should be the case. Is there evidence that it is not?
We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.
Well yes, that's what we currently do here:
The problem is that this command has to succeed for every range on every replica in one go (with a few tight retries), otherwise the entire migration fails and has to restart. This can be a problem in large clusters with many ranges (in this case, 400.000 ranges).
Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
This should be the case. Is there evidence that it is not?
No, just something we should verify.
One thing we're sorely lacking is an integration/qualification/acceptance test to run these migrations at very large scales. It could've shaken out some of these low timeouts for e.g. or further validated the need for generic checkpointing.
Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
This is worth doing, perhaps as part of https://github.com/cockroachdb/cockroach/issues/84073.
Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here:
A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here
That linked library should not be particularly difficult to generalize.
Would it make sense to also lower the txn priority inside IterateRangeDescriptors (and possibly the implicit txn used for the MigrateRequest), to reduce the visible contention on meta2 from SQL?
I wouldn't use a transaction at all, and instead scan smaller batches of up-to-date descriptors with individual scan requests. No objection on doing these with low priority, but we specifically don't want a snapshot of meta2, we want fresh data.
Using smaller txns (and lowering their priority if still needed) to fetch batches of range descriptors makes sense.
I'll apologize for not having done just all this originally, this simple thing has caused me much grief. Had I kicked the tires more on realistic cluster sizes (100k+ ranges with ambient split/merge activity), all this would've been much more apparent.
Irfan, using smaller txns with lower priority is not mentioned on the issue erik linked. Mind filing a new followup issue for that (separate) change?
Do you mean this issue? I've added a bullet point here.
thanks!
Since we'll be adding a fair bit of concurrency here to improve throughput for trivial migrations, we should probably integrate this with AC somehow as well, to avoid expensive migrations overloading the cluster.
Adding O-support, since we've had several escalations about below-Raft migrations stalling upgrades.
Long-running migrations can send a
MigrateRequest
for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.https://github.com/cockroachdb/cockroach/blob/455cdddc6d75c03645f486b22970e5c6198a8d56/pkg/kv/kvserver/replica_write.go#L254-L257
This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the
postSeparatedIntentsMigration
uses 5 retries for a given range and then fails the entire migration, having to restart:https://github.com/cockroachdb/cockroach/blob/4df8ac262e72c58cb4e09dcf1beb0b8ff6fdde27/pkg/migration/migrations/separated_intents.go#L557-L563
This could be improved in several ways:
Jira issue: CRDB-11351
Epic CRDB-39898