cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

Historical CPU profiles #91299

Closed kevinkokomani closed 1 month ago

kevinkokomani commented 2 years ago

Is your feature request related to a problem? Please describe.

There is some notion of historical heap profiles and goroutine dumps in the debug zip, gathered at certain times (I believe when there is an uptick in goroutines or memory consumption over a certain proportional threshold). Whereas with CPU profiles, the only way we get them are manually via the DB Console endpoint, or manually at the moment in time that a debug zip is taken.

Since CPU spikes are often short, we would like to have a way to have them automatically gathered if there is a spike. The other option is to monitor the CPU percentage actively until we see a spike begin to occur, and then quickly navigate to the correct place in the DB console or run the curl. Since CPU spike windows can be so short, this is not a guaranteed success either.

Currently, we lack observability into CPU consumption for short spikes, whereas sustained CPU increases are more easily observable.

Describe the solution you'd like

Something to collect CPU profiles automatically and save them if we see a spike in CPU usage, maybe even have this threshold be configurable by a cluster setting. Similar to how it works for goroutine dumps and heapprofs.

Describe alternatives you've considered

Some sort of custom-made script that will watch the CPU consumption per node, and "wake up" to get a token and curl the correct endpoint for the correct node(s) if we see an increase in CPU above a certain proportional threshold (maybe compared to a rolling CPU average for each individual node). The concern is that this wouldn't act fast enough to get decent CPU profiles for short enough spikes, either.

@thtruo

Jira issue: CRDB-21194

Epic CRDB-20791

kvoli commented 2 years ago

fwiw I have trialed https://github.com/parca-dev/parca against a cluster running for benchmarking. This does what you are suggesting and could readily be deployed with CockroachDB, similar to Grafana and Prometheus.

A solution that collects continuous profiles within the binary would be ideal. I think this could be within the domain of the new obs service, however in the meantime a simple UI similar to comparable DBs would be great.