Dump currently running queries, connections, jobs, etc. upon crash (OOM / panic / etc.)

tim-o commented 4 years ago

Is your feature request related to a problem? Please describe. When a node OOMs, it's often due to failures to account for memory used to execute a particular statement. We often have to rely on client application logs to figure out which specific statement caused the OOM. In one recent example, it took multiple hours on a remote to determine that a statement that was unexpectedly orders of magnitude larger than the end user expected was a culprit. This could've been figured out at a glance if we were able to see what was running at the time of the OOM.

Describe the solution you'd like We're already discussing collecting memory/cpu profiles and goroutine dumps periodically. We should also dump open connections/queries and statement performance, periodically and when a node OOMs, so we're not stuck trying to yank them while the problem is happening live.

Jira issue: CRDB-3911

blathers-crl[bot] commented 4 years ago

Hi @tim-o, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

joshimhoff commented 4 years ago

We should also dump open connections/queries and statement performance, periodically and when a node OOMs, so we're not stuck trying to yank them while the problem is happening live.

I think this should happen whenever a node crashes. This helps us find "queries of death" that cause panics, in addition to OOMs. Isolating and blacklisting "queries of death" causing panic or OOM even more critical when we start running multi-tenant CRDB.

I expect it's tricky to do this. How do you make sure to write out all queries before crash? Doesn't have to be perfect to be useful.

sumeerbhola commented 4 years ago

I expect it's tricky to do this. How do you make sure to write out all queries before crash?

A way to avoid this issue is to maintain an mmaped file, and treat it as a circular buffer to which queries are added before starting to execute them, and erased when complete. Process crash will not prevent the dirty pages from being eventually flushed.

joshimhoff commented 4 years ago

VERY COOL!

joshimhoff commented 4 years ago

Related in that it compliments this issue: https://github.com/cockroachdb/cockroach/issues/51643

joshimhoff commented 3 years ago

I'm gonna gently bump this issue whenever there is a CC production issue that we don't quite get to a smoking gun for but could have with this tool built.

This is the first such bump. CC @RaduBerinde.

(I know we are all very busy and in fact the SRE team could build this tool (but are also very busy). Not trying to apply pressure exactly. Trying to make sure we see how many root causes we miss by not building obserability tools early.)

joshimhoff commented 3 years ago

I swear I didn't plan this. A 27 node CC cluster experienced a high impact series of OOMs (two ~10m periods of 1/15 ranges unavailable according to unavailable ranges metric). This tool again would help.

joshimhoff commented 3 years ago

We should consider expanding this to store data about memory usage, e.g. how much total mem usage there is, how much SQL accounted for usage there is, etc. Any data that will help us both understand causes of crashes well enough in real time to actually mitigate & root cause crashes post incident should be included.

OOMs come to mind mainly. OOMs are VERY difficult to mitigate & often no root caused even when DB eng looks in addition to SRE. The crash dump should make clear which aspects of our memory monitoring worked as expected & failed.

joshimhoff commented 3 years ago

Here is another issue where we are slower root causing due to lack of this tool: https://github.com/cockroachdb/cockroach/issues/64661

(I know we have a lot to do but you must forgive me when I beat my drum; it is the only way I can stay sane under the thumb of the pager!)

knz commented 3 years ago

cc @thtruo - jordan highlighted to me the many benefits of this work, but it's not on the roadmap yet. I predict that a good outcome here would be reached with a small partnership between server and sql queries team.

Can you create a jira initiative/epic for this and bring it up during our next team/planning meeting?

knz commented 3 years ago

@thtruo - for the case: Jordan spells out that this comes up often in support cases.

joshimhoff commented 3 years ago

There is a CC production customer who recently experienced an out of resources outage where we (many different engineers at CRL) are having a hard time getting a root cause. This might help! Strong business case can be made based on that IMO; I can give details to whoever (but won't include them here).

💯 to some action on this one! Tho I also know we are all under considerable bandwidth constraints and prioritization is quite hard right now.

thtruo commented 3 years ago

Created an epic and CC'd ya'll on it. I'll mention it in the Server team's meeting agenda to start

knz commented 3 years ago

ok so in my opinion, the idea of a mmapped circular buffer is totally impractical and a non-starter, for 3 main reasons:

it would become a bottleneck for all the concurrent goroutines; at large query loads (e.g. on a 16+ core machine) it would start to constraint QPS.
it would incur serialization overhead (to print out all the bytes in that buffer) even in the common case where the data is not needed
it would produce data that's hard to use; we'll need to invent a delimiter syntax inside the representation to decide where entries start and end, where the "current position" is; we'd need to invent a custom parser / view tool; and we'd run the risk of inconsistent data if the process crashes mid-write.

Additionally it's somewhat non-portable: we don't have mmap on all platforms we want to support (windows)

Here is the simpler, more portable, more practical and more extensible solution:

implement a new crashdumper alongside the existing goroutinedumper and cpuprofiler using the same framework (profiler_common)
inside that dumper's logic, put code that:
- calls a sequence of dumper callback functions, that would be registered by various packages of interest (e.g. sql, rpc, server etc)
- writes the data into either a single ZIP file, or a collection of multiple files with different extensions; one item per callback
hook the new dumper created at step 1 to the linux-specific "memory.pressure_level notification" OS mechanism, in addition to the natural periodic behavior of the current framework

(i.e. the dumper would run every X seconds by default, and then also one time extra when the memory pressure trigger fires)
register hooks to this crash dumper throughout the remainder of cockroachdb:
- make sql dump queries and txns
- make server dump node-to-node connections
- make log dump the current tracing registry
- etc

joshimhoff commented 3 years ago

Ignoring the fact that it might not be practical for a moment, the nice thing about the more continuous approach suggested by @sumeerbhola is that we always will have queries being executed exactly at crash time on disk, regardless of the cause of the crash (SIGKILL due to OOM, panic, etc.), and relatedly, we need not set up a bunch of triggers which may or may not work consistently in different production environments. In particular, I feel the memory.pressure_level notification idea is both very interesting AND risky; there are multiple kinds of out of memory actions on k8s (discussed more here: https://github.com/cockroachdb/cockroach/issues/65127); how consistently will we be able to write a dump to disk ahead of SIGKILL given all that? How do we test it (esp. given the dep on production environment details)? Will it work when cluster is CPU overloaded? etc. etc. All these Qs are side-stepped by the more continuous approach.

Now, it is unreasonable to ignore whether it is practical obviously. I don't have input to give on that Q right now but I am very curious to see what @sumeerbhola has to say about the various issues Raphael brings up!

knz commented 3 years ago

All these Qs are side-stepped by the more continuous approach.

It's not because one set of Qs is side-stepped that the approach is simpler/better.

Once you recognize that the mmap approach raises a different set of harder questions instead, we need to talk about trade-offs.

joshimhoff commented 3 years ago

Once you recognize that the mmap approach raises a different set of harder questions instead, we need to talk about trade-offs.

I agree with this. I do see what you mean re: tradeoffs. And the only way we get a complete view of the tradeoffs is argument, even playing devil's advocate. Good thing we all love argument!

knz commented 3 years ago

yes sure - and possibly all this needs to make its way to a RFC eventually

In the meantime, a periodic crash dumper with a similar heuristic as our current heapprofiler could be an incremental tactical step that already makes a difference, and wouldn't require much additional research on trade-offs. Might be worth trying it out since it wouldn't have much performance downside.

joshimhoff commented 3 years ago

it would become a bottleneck for all the concurrent goroutines; at large query loads (e.g. on a 16+ core machine) it would start to constraint QPS.

If it reduces performance in a real way, that's a no go, agreed. I'm waiting to see what Sumeer thinks about this. I wonder about the concrete implementation details they have in their head / whether they would agree with the above characterization.

it would incur serialization overhead (to print out all the bytes in that buffer) even in the common case where the data is not needed

Not sure I follow this one. Can you say more? When would the serialization overhead be incurred? How much serialization overhead? What about the serialization overhead leads to a worse customer experience? Are you worried about effects on performance?

it would produce data that's hard to use; we'll need to invent a delimiter syntax inside the representation to decide where entries start and end, where the "current position" is; we'd need to invent a custom parser / view tool

Noted.

; and we'd run the risk of inconsistent data if the process crashes mid-write.

Noted.

To me, the perf issue is the one that could make the approach a "non-starter" but I also question the extent of the perf impact. The latter two issues seem more in the realm of drawbacks that should be weighed against benefits of the approach. Also, I agree with this:

In the meantime, a periodic crash dumper with a similar heuristic as our current heapprofiler could be an incremental tactical step that already makes a difference

knz commented 3 years ago

Can you say more? When would the serialization overhead be incurred? How much serialization overhead? What about the serialization overhead leads to a worse customer experience? Are you worried about effects on performance?

What data do you think would be written to the file?

Today we have a tree-like data structure to describe SQL queries in memory. If we want the queries to appear in a mmapped file, we'll need an algorithm that recurses into that tree to print out the SQL into the file.

For every query.

That's

1) expensive time-wise (performance) 2) expensive memory-wise (will make the OOM situation worse)

RaduBerinde commented 3 years ago

I believe we have the SQL string readily available (from the parser). The file can be sharded into a few buffers to address contention. IMO there is something to the mmap idea (maybe not to replace the heap dumps but as additional data). We would know without ifs or buts what was running when we crashed.

RaduBerinde commented 3 years ago

Eg if a single query triggers a rapid OOM, I think there is very little hope for a "pre-crash" signal monitor to capture something useful.

jordanlewis commented 3 years ago

@knz, the way you're talking it sounds like you'd prefer to reject trying out a promising new idea based on a hunch! I'm sure that's not what you intend to do 😄

We should try @sumeerbhola's idea. To me, it is the most compelling proposal, because it is an always-up-to-date representation of what happened at crash time, like @RaduBerinde says. And it seems likely to me that we can get around the performance issues one way or another.

Does anybody want to prototype this idea? Maybe @sumeerbhola, if you're interested in showing us how it's done, or @abarganier?

knz commented 3 years ago

I am not fond of solutions using mmap in general because from experience:

it's complicated to implement right so that it's not going to be a perf bummer (brittle)
it's going to create a file format that's hard to consume, and cost us engineering hours to process reliably (See my point above)

And then on top of that it's going to extend the scope of things the server team has to care about.

These are all opportunity+future ongoing maintenance costs and we don't have a compelling argument that the marginal benefit over a periodic report will offset that.

abarganier commented 3 years ago

Just noting that the following PR is related to this effort: #66901

While it doesn't achieve exactly what we're looking for here, it does "chip away" at the overarching goal and buys us some intermediate gains in debug ability.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

Dump currently running queries, connections, jobs, etc. upon crash (OOM / panic / etc.) #52815