cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.92k stars 3.78k forks source link

kvserver: pause replication activity in a cluster #81953

Open lunevalex opened 2 years ago

lunevalex commented 2 years ago

In #81935 we discuss the prioritization of replication activity at the store level. In the same vein we should consider manual knobs to disable classes of replication/snapshot activity in a cluster. This should be controlled at the cluster level via a setting(s). For example: a situation could arise where an operator starts decommissioning a node, which may cause a latency impact or instability in the cluster. We have seen that happen for a variety of reasons before and numerous customers. In this case it would be extremely helpful to have a single universal knob to pause all decommissioning across all nodes.

We should consider the following buckets of replication activity we should consider having control over

Jira issue: CRDB-16313

irfansharif commented 2 years ago

+cc @AlexTalks. For decommission, I wonder if we should make strides towards cancellation ("recommission") being as non-disruptive as possible WRT foreground tail impact + throughput. Arguably decommission should be too, but it's equally unfortunate that canceling inflight decommission attempts (due to observed app impact) will get you into the same regime of impact because we'd still be shuffling snapshots around (except this time, back to the node we were trying to previously decommission).

lunevalex commented 2 years ago

There is already a setting to disable/enable the store rebalancer https://github.com/cockroachdb/cockroach/blob/0e5927ab972c077ff8e6ad113fbcbb5c2d837a20/pkg/kv/kvserver/store_rebalancer.go#L66