cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

allocator: shed leases on extended CPU overload relative cluster mean #127975

Open kvoli opened 1 month ago

kvoli commented 1 month ago

Is your feature request related to a problem? Please describe. Its rare but not impossible for a subsystem within CRDB to misbehave or behave unexpectedly, pegging process CPU utilization near 100%, while the rest of the cluster has vacant capacity.

CPU rebalancing attributes kvserver CPU usage to replicas and in most situations is sufficient, as the CPU usage is acted upon by transferring leases anyway. In cases where its not attributed, CPU rebalancing is insufficient.

Describe the solution you'd like

Add functionality to the allocator which will shed all leases from a node when its

  1. CPU utilization is pegged >90% for an extended duration AND
  2. CPU utilization is more than 1.5x the mean cluster CPU utilization.

Describe alternatives you've considered

Using process CPU utilization, instead of aggregate replica CPU usage, this problem is also solved via balancing (there's still a risk of no leases or replicas to transfer away).

Jira issue: CRDB-40714

Epic CRDB-39952

kvoli commented 1 month ago

I'm going to re-assign this as P3, this isn't something that commonly comes up and is also not relatively high priority that it would be complete within 3 months.