Open tbg opened 1 year ago
I've been thinking some more about your proposal and I'm bullish that we have almost all the parts to put something together. If we emitted the runtime traces as StructuredEvents for each of the spans subscribed to the profiler, we could have a client-side aggregator (example: https://github.com/cockroachdb/cockroach/pull/84043) bucket these by node, and periodically flush them to disk. For queries, this could be somewhere on local storage, while jobs can insert a row into the new system.job_info
table. These rows can then be retrieved+processed by whatever tool we decide to build for jobs observability. This is in line with how we plan to aggregate trace-driven statistics for jobs (example: https://github.com/cockroachdb/cockroach/pull/96968). I'll try to get a prototype for this going so that we can discover the pitfalls.
If we emitted the runtime traces as StructuredEvents
If we can bound the size of profiles acquired by a local root span (i.e. transitively with its children) then that seems like a great first step. It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption (I think a few MB of budget would be fine, dropping profiles after that and putting information about that fact in the span. Not needed for POC obvs).
I'll try to get a prototype for this going so that we can discover the pitfalls.
❤️
It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption
A second long runtime trace of a single node testserver is ~200KiB. That already blows past the size we allow a structured event to grow to. So emitting these traces as StructuredEvent payloads on the spans that subscribe to such runtime traces might be a non-starter.
This leaves us with the model of flushing the traces to disk in a directory of our choosing and then sticking the directory into the span to make its way back to the node that is coordinating the query/job. I'm not sure if the job profile or statement bundle machinery should then reach out to all these directories and try stitching the traces as a whole, or just leave it up to the user to open a trace visualizer for each outputted file.
I'm going to tentatively assign this to myself and put it on the DR board for planning purposes. I may or may not get this all the way to completion but I'm keen to have something we can use to collect query and job runtime profiles + traces.
cc @cockroachdb/disaster-recovery
Some more thoughts from a recent escalation:
we could also more tightly integrate statement bundle requests with execution traces:
In another recent (and currently ongoing) escalation^1, we are seeing a demonstration of the power of execution traces. Starting from a trace that showed a slow span in DistSender, we were able to use execution traces to narrow this down to a slow (700ms) syscall reading off the TCP connection (with a similar analysis being in-process on the server side, outcome yet unknown). This level of visibility is quite powerful and allows us to discriminate between issues in "our code" vs the OS/networking layer. This is more evidence that making it easier for customers to collect matching trace/exec trace data is worth our while.
Is your feature request related to a problem? Please describe.
Traces can answer some questions today, but not others. In particular, unexpected delays between trace events like Go scheduler latency, lock contention, high CPU usage in a subsystem such as pebble, etc, lead to time-consuming and often fruitless investigations. A "big hammer" that would allow us to get to the bottom of most of them would be invaluable.
Describe the solution you'd like
A big hammer whose centerpiece are automatic CPU and/or Go execution profiles (we suggest implementing both since they're complementary and behave the same from an API point of view).
Only a single CPU profile/execution trace (henceforth called profile interchangeably for simplicitly) can be active per node at any given point in time. We add a service on each node that multiplexes profile requests onto slots of duration <=1s long, that is, profiles are taken in 1s or shorter chunks and multiple (local in-process) listeners can subscribe to profiles for however long they wish, which means they'll get notified of any 1s chunks overlapping their tenure. No profiles are taken in the absence of listeners. Profiles get stored on node-local storage and have an identifier with which they can be retrieved at a later time. (There are retention policies, tenant access rules, and everything you need to make this work for UA, etc, details TBD).
We can use this to enable the following use cases:
For a practical example where we'd loved to have CPU profiles with traces, see https://github.com/cockroachlabs/support/issues/2033. For Go exec traces, they would be invaluable for the NUMA issues we've seen, and many latency investigations that come with unexplainable gaps in the trace in non-CPU intensive parts. There are likely sched latencies at play and an execution trace would prove it. But we only have pprof-loop which is hard to use and isn't included in the initial outreach, so we spend lots of time asking for (and often not getting) this additional info. Customers spend lots of time putting it together too.
[^1]: note that if the diagnostics request has a "min latency" bound, this might in effect turn on profiling continuously on some or all nodes in the cluster, so we need to validate that this is OK, perhaps using suitable RBAC to gate it only to admins, or only allow CPU profiles with a lower profiling rate, etc
Describe alternatives you've considered
Can't really think of much
Additional context
Jira issue: CRDB-24848
Epic CRDB-32402