Open tim-o opened 4 years ago
Hi @tim-o, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.
While you're here, please consider adding an A- label to help keep our repository tidy.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
We should also dump open connections/queries and statement performance, periodically and when a node OOMs, so we're not stuck trying to yank them while the problem is happening live.
I think this should happen whenever a node crashes. This helps us find "queries of death" that cause panics, in addition to OOMs. Isolating and blacklisting "queries of death" causing panic or OOM even more critical when we start running multi-tenant CRDB.
I expect it's tricky to do this. How do you make sure to write out all queries before crash? Doesn't have to be perfect to be useful.
I expect it's tricky to do this. How do you make sure to write out all queries before crash?
A way to avoid this issue is to maintain an mmaped file, and treat it as a circular buffer to which queries are added before starting to execute them, and erased when complete. Process crash will not prevent the dirty pages from being eventually flushed.
VERY COOL!
Related in that it compliments this issue: https://github.com/cockroachdb/cockroach/issues/51643
I'm gonna gently bump this issue whenever there is a CC production issue that we don't quite get to a smoking gun for but could have with this tool built.
This is the first such bump. CC @RaduBerinde.
(I know we are all very busy and in fact the SRE team could build this tool (but are also very busy). Not trying to apply pressure exactly. Trying to make sure we see how many root causes we miss by not building obserability tools early.)
I swear I didn't plan this. A 27 node CC cluster experienced a high impact series of OOMs (two ~10m periods of 1/15 ranges unavailable according to unavailable ranges metric). This tool again would help.
We should consider expanding this to store data about memory usage, e.g. how much total mem usage there is, how much SQL accounted for usage there is, etc. Any data that will help us both understand causes of crashes well enough in real time to actually mitigate & root cause crashes post incident should be included.
OOMs come to mind mainly. OOMs are VERY difficult to mitigate & often no root caused even when DB eng looks in addition to SRE. The crash dump should make clear which aspects of our memory monitoring worked as expected & failed.
Here is another issue where we are slower root causing due to lack of this tool: https://github.com/cockroachdb/cockroach/issues/64661
(I know we have a lot to do but you must forgive me when I beat my drum; it is the only way I can stay sane under the thumb of the pager!)
cc @thtruo - jordan highlighted to me the many benefits of this work, but it's not on the roadmap yet. I predict that a good outcome here would be reached with a small partnership between server and sql queries team.
Can you create a jira initiative/epic for this and bring it up during our next team/planning meeting?
@thtruo - for the case: Jordan spells out that this comes up often in support cases.
There is a CC production customer who recently experienced an out of resources outage where we (many different engineers at CRL) are having a hard time getting a root cause. This might help! Strong business case can be made based on that IMO; I can give details to whoever (but won't include them here).
💯 to some action on this one! Tho I also know we are all under considerable bandwidth constraints and prioritization is quite hard right now.
Created an epic and CC'd ya'll on it. I'll mention it in the Server team's meeting agenda to start
ok so in my opinion, the idea of a mmapped circular buffer is totally impractical and a non-starter, for 3 main reasons:
Additionally it's somewhat non-portable: we don't have mmap
on all platforms we want to support (windows)
Here is the simpler, more portable, more practical and more extensible solution:
implement a new crashdumper
alongside the existing goroutinedumper
and cpuprofiler
using the same framework (profiler_common
)
inside that dumper's logic, put code that:
sql
, rpc
, server
etc)hook the new dumper created at step 1 to the linux-specific "memory.pressure_level notification" OS mechanism, in addition to the natural periodic behavior of the current framework
(i.e. the dumper would run every X seconds by default, and then also one time extra when the memory pressure trigger fires)
register hooks to this crash dumper throughout the remainder of cockroachdb:
sql
dump queries and txnsserver
dump node-to-node connectionslog
dump the current tracing registryIgnoring the fact that it might not be practical for a moment, the nice thing about the more continuous approach suggested by @sumeerbhola is that we always will have queries being executed exactly at crash time on disk, regardless of the cause of the crash (SIGKILL due to OOM, panic, etc.), and relatedly, we need not set up a bunch of triggers which may or may not work consistently in different production environments. In particular, I feel the memory.pressure_level
notification idea is both very interesting AND risky; there are multiple kinds of out of memory actions on k8s (discussed more here: https://github.com/cockroachdb/cockroach/issues/65127); how consistently will we be able to write a dump to disk ahead of SIGKILL given all that? How do we test it (esp. given the dep on production environment details)? Will it work when cluster is CPU overloaded? etc. etc. All these Qs are side-stepped by the more continuous approach.
Now, it is unreasonable to ignore whether it is practical obviously. I don't have input to give on that Q right now but I am very curious to see what @sumeerbhola has to say about the various issues Raphael brings up!
All these Qs are side-stepped by the more continuous approach.
It's not because one set of Qs is side-stepped that the approach is simpler/better.
Once you recognize that the mmap approach raises a different set of harder questions instead, we need to talk about trade-offs.
Once you recognize that the mmap approach raises a different set of harder questions instead, we need to talk about trade-offs.
I agree with this. I do see what you mean re: tradeoffs. And the only way we get a complete view of the tradeoffs is argument, even playing devil's advocate. Good thing we all love argument!
yes sure - and possibly all this needs to make its way to a RFC eventually
In the meantime, a periodic crash dumper with a similar heuristic as our current heapprofiler could be an incremental tactical step that already makes a difference, and wouldn't require much additional research on trade-offs. Might be worth trying it out since it wouldn't have much performance downside.
it would become a bottleneck for all the concurrent goroutines; at large query loads (e.g. on a 16+ core machine) it would start to constraint QPS.
If it reduces performance in a real way, that's a no go, agreed. I'm waiting to see what Sumeer thinks about this. I wonder about the concrete implementation details they have in their head / whether they would agree with the above characterization.
it would incur serialization overhead (to print out all the bytes in that buffer) even in the common case where the data is not needed
Not sure I follow this one. Can you say more? When would the serialization overhead be incurred? How much serialization overhead? What about the serialization overhead leads to a worse customer experience? Are you worried about effects on performance?
it would produce data that's hard to use; we'll need to invent a delimiter syntax inside the representation to decide where entries start and end, where the "current position" is; we'd need to invent a custom parser / view tool
Noted.
; and we'd run the risk of inconsistent data if the process crashes mid-write.
Noted.
To me, the perf issue is the one that could make the approach a "non-starter" but I also question the extent of the perf impact. The latter two issues seem more in the realm of drawbacks that should be weighed against benefits of the approach. Also, I agree with this:
In the meantime, a periodic crash dumper with a similar heuristic as our current heapprofiler could be an incremental tactical step that already makes a difference
Can you say more? When would the serialization overhead be incurred? How much serialization overhead? What about the serialization overhead leads to a worse customer experience? Are you worried about effects on performance?
What data do you think would be written to the file?
Today we have a tree-like data structure to describe SQL queries in memory. If we want the queries to appear in a mmapped file, we'll need an algorithm that recurses into that tree to print out the SQL into the file.
For every query.
That's
1) expensive time-wise (performance) 2) expensive memory-wise (will make the OOM situation worse)
I believe we have the SQL string readily available (from the parser). The file can be sharded into a few buffers to address contention. IMO there is something to the mmap idea (maybe not to replace the heap dumps but as additional data). We would know without ifs or buts what was running when we crashed.
Eg if a single query triggers a rapid OOM, I think there is very little hope for a "pre-crash" signal monitor to capture something useful.
@knz, the way you're talking it sounds like you'd prefer to reject trying out a promising new idea based on a hunch! I'm sure that's not what you intend to do 😄
We should try @sumeerbhola's idea. To me, it is the most compelling proposal, because it is an always-up-to-date representation of what happened at crash time, like @RaduBerinde says. And it seems likely to me that we can get around the performance issues one way or another.
Does anybody want to prototype this idea? Maybe @sumeerbhola, if you're interested in showing us how it's done, or @abarganier?
I am not fond of solutions using mmap in general because from experience:
And then on top of that it's going to extend the scope of things the server team has to care about.
These are all opportunity+future ongoing maintenance costs and we don't have a compelling argument that the marginal benefit over a periodic report will offset that.
Just noting that the following PR is related to this effort: #66901
While it doesn't achieve exactly what we're looking for here, it does "chip away" at the overarching goal and buys us some intermediate gains in debug ability.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Is your feature request related to a problem? Please describe. When a node OOMs, it's often due to failures to account for memory used to execute a particular statement. We often have to rely on client application logs to figure out which specific statement caused the OOM. In one recent example, it took multiple hours on a remote to determine that a statement that was unexpectedly orders of magnitude larger than the end user expected was a culprit. This could've been figured out at a glance if we were able to see what was running at the time of the OOM.
Describe the solution you'd like We're already discussing collecting memory/cpu profiles and goroutine dumps periodically. We should also dump open connections/queries and statement performance, periodically and when a node OOMs, so we're not stuck trying to yank them while the problem is happening live.
Jira issue: CRDB-3911