Is your feature request related to a problem? Please describe.
Its rare but not impossible for a subsystem within CRDB to misbehave or behave unexpectedly, pegging process CPU utilization near 100%, while the rest of the cluster has vacant capacity.
CPU rebalancing attributes kvserver CPU usage to replicas and in most situations is sufficient, as the CPU usage is acted upon by transferring leases anyway. In cases where its not attributed, CPU rebalancing is insufficient.
Describe the solution you'd like
Add functionality to the allocator which will shed all leases from a node when its
CPU utilization is pegged >90% for an extended duration AND
CPU utilization is more than 1.5x the mean cluster CPU utilization.
Describe alternatives you've considered
Using process CPU utilization, instead of aggregate replica CPU usage, this problem is also solved via balancing (there's still a risk of no leases or replicas to transfer away).
I'm going to re-assign this as P3, this isn't something that commonly comes up and is also not relatively high priority that it would be complete within 3 months.
Is your feature request related to a problem? Please describe. Its rare but not impossible for a subsystem within CRDB to misbehave or behave unexpectedly, pegging process CPU utilization near 100%, while the rest of the cluster has vacant capacity.
CPU rebalancing attributes kvserver CPU usage to replicas and in most situations is sufficient, as the CPU usage is acted upon by transferring leases anyway. In cases where its not attributed, CPU rebalancing is insufficient.
Describe the solution you'd like
Add functionality to the allocator which will shed all leases from a node when its
Describe alternatives you've considered
Using process CPU utilization, instead of aggregate replica CPU usage, this problem is also solved via balancing (there's still a risk of no leases or replicas to transfer away).
Jira issue: CRDB-40714
Epic CRDB-39952