cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

kvserver: add knob that forces node to shed leases and be unable to re-acquire #116061

Open kvoli opened 11 months ago

kvoli commented 11 months ago

Is your feature request related to a problem? Please describe.

There are scenarios where a node should not hold leases due to impact on the workload, primarily when the node is overloaded or operating on faulty hardware.

In these scenarios, automatic mechanisms may not reliably kick in and having a the ability to force leases off the node without draining it is desirable.

Describe the solution you'd like

Introduce a knob (crdb_internal or CLI), which marks the node as unable to hold a lease and also kicks off shedding leases from the node.

Note relying on the replicate queue to shed leases won't reliably work, as other high priority actions (up-replication) will get acted upon first. The solution should either piggyback off the range lease drain phase, or introduce another mechanism.

Describe alternatives you've considered

Setting an anti-affinity for the node using lease preferences could be a possible solution, however runs into the same problem as above, stuck behind other replicate queue actions.

Additional context

(internal) discussion.

Jira issue: CRDB-34424

andrewbaptist commented 11 months ago

This is related to #88007 and #57093

Based on the severity of recent incidents, we should figure out how to do this and backport to 23.1 and 23.2.