cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.06k stars 3.8k forks source link

Enhance observability into Admission Control based resource usage. #103684

Open steven-hubbard opened 1 year ago

steven-hubbard commented 1 year ago

Is your feature request related to a problem? Please describe. When using metrics in CRDB to alert, it would be nice to see the CPU utilization for just the workload excluding what is in use by Admission Control. This graph is alarming to the customer as it shows 95% CPU utilization (or more) and removes the ability to alert on CPU usage over 75% which is where we generally advise that they increase their cluster size.

Describe the solution you'd like Two separate graphs. One for TOTAL CPU utilization and one for foreground CPU utilization

Describe alternatives you've considered Cap Admission Control utilization at a value between 65 and 75%

Additional context

Jira issue: CRDB-28133

dikshant commented 1 year ago

@irfansharif do we have metrics to distinguish CPU utilization for different things running under admission control with different priorities? For example, if my workload is high priority and my TTL job is low priority, is there a metric I can look a to determine how much CPU is in use by each of those things?

irfansharif commented 1 year ago

There's something similar - admission_elastic_utilization. That tells you the aggregate amount of elastic CPU being used by backups, changefeed catch up scans, rangefeed catchups, and now, row-level TTL selects. It doesn't distinguish within them, but maybe that's sufficient. I often eyeball this and the vanilla CPU% graphs to figure out the delta (or if using grafana, just plot the delta directly).