Open tbg opened 1 year ago
cc @cockroachdb/replication
I'd be inclined to not try to be fancy here, and just slap a fixed (but configurable) 10 minute timeout on snapshot sending. Either that, or as you say
a keepalive type of approach where a snapshot is allowed to proceed just as long as it makes progress would avoid this class of problems altogether
We already have a 1 minute min budget for the other queues, are you sure we should add a second cluster setting? Maybe we could say that the "allowed slowdown" of 10 also applies to the "min duration", i.e. the snap queue gets 10 minutes by default? That would be fairly surgical.
I realized that the affected customer had the this cluster setting set:
"kv.bulk_io_write.max_rate": {
"value": "1.0 MiB",
"type": "z",
"description": "the rate limit (bytes/sec) to use for writes to disk on behalf of bulk io ops",
"public": true,
"last_updated": "2023-05-19T20:54:42.751589Z"
},
We use the corresponding limiter when writing the snapshot files, so this setting basically sets a floor. We could factor it on (on the sender), i.e. lower the transfer rate accordingly.
Assigning to @AlexTalks after discussion last week.
This is also related to #108519, in that we may want senders to wait while queueing for a snapshot reservation instead of erroring out and retrying later.
Describe the problem
See internal issue https://github.com/cockroachlabs/support/issues/2327#issuecomment-1562510835, observed on 22.2.9 but code hasn't changed materially on master.
The gist of it is that we saw an instance of a CC cluster where raft snaps for certain ranges never got through within the timeout. We compute the timeout based on the logical stats of the replica, however we do this even if the replica contains estimates, where we have no real error bounds on how far off the true size we are. We assume that we are able to send at ~10% of the configured max snapshot rate, which is certainly mostly true, but appears to not have been true here (for reasons still unknown).
We ended up allotting timeouts in the 1-2m range to snapshots that clearly needed more time.
We already have a maximum snapshot duration of 1h:
https://github.com/cockroachdb/cockroach/blob/80535e10ccde8effdb3787d474e5e0c962f428f4/pkg/kv/kvserver/replica_command.go#L62-L70
but it might be useful to also introduce a minimum that is by default higher than the guaranteed processing time of 1m:
https://github.com/cockroachdb/cockroach/blob/5c1909bdefc0b4b1238f44334ec66b6d6fc602c8/pkg/kv/kvserver/queue.go#L57-L63
Last but not least, context cancellation is perhaps too brute a mechanism, a keepalive type of approach where a snapshot is allowed to proceed just as long as it makes progress would avoid this class of problems altogether.
Jira issue: CRDB-28236
Epic CRDB-39898