cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.63k stars 3.71k forks source link

kvserver: timeouts can starve snapshots #103879

Open tbg opened 1 year ago

tbg commented 1 year ago

Describe the problem

See internal issue https://github.com/cockroachlabs/support/issues/2327#issuecomment-1562510835, observed on 22.2.9 but code hasn't changed materially on master.

The gist of it is that we saw an instance of a CC cluster where raft snaps for certain ranges never got through within the timeout. We compute the timeout based on the logical stats of the replica, however we do this even if the replica contains estimates, where we have no real error bounds on how far off the true size we are. We assume that we are able to send at ~10% of the configured max snapshot rate, which is certainly mostly true, but appears to not have been true here (for reasons still unknown).

We ended up allotting timeouts in the 1-2m range to snapshots that clearly needed more time.

We already have a maximum snapshot duration of 1h:

https://github.com/cockroachdb/cockroach/blob/80535e10ccde8effdb3787d474e5e0c962f428f4/pkg/kv/kvserver/replica_command.go#L62-L70

but it might be useful to also introduce a minimum that is by default higher than the guaranteed processing time of 1m:

https://github.com/cockroachdb/cockroach/blob/5c1909bdefc0b4b1238f44334ec66b6d6fc602c8/pkg/kv/kvserver/queue.go#L57-L63

Last but not least, context cancellation is perhaps too brute a mechanism, a keepalive type of approach where a snapshot is allowed to proceed just as long as it makes progress would avoid this class of problems altogether.

Jira issue: CRDB-28236

Epic CRDB-39898

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

erikgrinaker commented 1 year ago

I'd be inclined to not try to be fancy here, and just slap a fixed (but configurable) 10 minute timeout on snapshot sending. Either that, or as you say

a keepalive type of approach where a snapshot is allowed to proceed just as long as it makes progress would avoid this class of problems altogether

tbg commented 1 year ago

We already have a 1 minute min budget for the other queues, are you sure we should add a second cluster setting? Maybe we could say that the "allowed slowdown" of 10 also applies to the "min duration", i.e. the snap queue gets 10 minutes by default? That would be fairly surgical.

tbg commented 1 year ago

I realized that the affected customer had the this cluster setting set:

    "kv.bulk_io_write.max_rate": {
      "value": "1.0 MiB",
      "type": "z",
      "description": "the rate limit (bytes/sec) to use for writes to disk on behalf of bulk io ops",
      "public": true,
      "last_updated": "2023-05-19T20:54:42.751589Z"
    },

We use the corresponding limiter when writing the snapshot files, so this setting basically sets a floor. We could factor it on (on the sender), i.e. lower the transfer rate accordingly.

erikgrinaker commented 1 year ago

Assigning to @AlexTalks after discussion last week.

erikgrinaker commented 12 months ago

This is also related to #108519, in that we may want senders to wait while queueing for a snapshot reservation instead of erroring out and retrying later.