Closed kevinkokomani closed 1 month ago
fwiw I have trialed https://github.com/parca-dev/parca against a cluster running for benchmarking. This does what you are suggesting and could readily be deployed with CockroachDB, similar to Grafana and Prometheus.
A solution that collects continuous profiles within the binary would be ideal. I think this could be within the domain of the new obs service, however in the meantime a simple UI similar to comparable DBs would be great.
Is your feature request related to a problem? Please describe.
There is some notion of historical heap profiles and goroutine dumps in the debug zip, gathered at certain times (I believe when there is an uptick in goroutines or memory consumption over a certain proportional threshold). Whereas with CPU profiles, the only way we get them are manually via the DB Console endpoint, or manually at the moment in time that a debug zip is taken.
Since CPU spikes are often short, we would like to have a way to have them automatically gathered if there is a spike. The other option is to monitor the CPU percentage actively until we see a spike begin to occur, and then quickly navigate to the correct place in the DB console or run the
curl
. Since CPU spike windows can be so short, this is not a guaranteed success either.Currently, we lack observability into CPU consumption for short spikes, whereas sustained CPU increases are more easily observable.
Describe the solution you'd like
Something to collect CPU profiles automatically and save them if we see a spike in CPU usage, maybe even have this threshold be configurable by a cluster setting. Similar to how it works for goroutine dumps and heapprofs.
Describe alternatives you've considered
Some sort of custom-made script that will watch the CPU consumption per node, and "wake up" to get a token and
curl
the correct endpoint for the correct node(s) if we see an increase in CPU above a certain proportional threshold (maybe compared to a rolling CPU average for each individual node). The concern is that this wouldn't act fast enough to get decent CPU profiles for short enough spikes, either.@thtruo
Jira issue: CRDB-21194
Epic CRDB-20791