Consistently root cause OOMs, at least in a cloud context

Describe the problem

We regularly don't root cause OOMs. We'd like to consistently root cause OOMs, at least in a cloud context.

CC current state

I'm not confident I have all this right but writing it up we can figure out what is right together!

CC runs CRBD on k8s. There are two memory monitoring processes in k8s.

K8s will kill pods spending more than memory limits set on the container. This is called an eviction.
The OS OOM killer will kill processes if there is not enough mem system wide. This is called an OOM.

The end result for the user is the same. Unexpected restart. But these are different concepts & both happen in CC.

For more: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

Note that we do NOT set mem limits on all containers including CRDB containers. This is bad tech debt. Serverless may fix this tech debt, as resource isolation is even more important there (tho it's really very important in dedicated obviously also). The end result of this is:

We hit OOMs more than evictions (note they aren't really very different at end user level), as the mem limits that would trigger eviction are set high.
Flags like --sql-max-mem are set a bit high, as percentages are used (25%), and available mem in container is an over-estimate, for example ignoring the ~300 MB that prometheus (which runs in the same k8s cluster) uses.
This terrible terrible k8s bug: https://github.com/kubernetes/kubernetes/issues/43916. TLDR: Disk cache usage "counts" as mem usage in certain cases, leading to OOM when could reduce size of disk cache. Gasp. The workaround is to set mem limits, which we should be doing anyway. Note that though this bug is egregious, it triggers eviction, not OOM, and we mostly see OOM in CC; mem eviction tends to hit pods other than CRDB.

Note that CRBD has functionality to take a heap profile during high mem growth events & save to disks.

Note we have no continuous profiling today in CC land.

Why we don't consistently root cause OOMs

Growth in usage may happen quick enough that OOM or eviction happens before heap profile is taken. Both OOM & eviction of lead to SIGKILL, thus no ability to trap in CRDB.
Not a lot of visibility into relationship between OOM & GC? I don't have a good mental mode here. Can GC issues lead to OOMs and if so are heap profiles helpful in diagnosing?
Heap profiles just measure heap usage; there are other sources of memory usage. (But are they important?)
The ergonomics of pulling a heap profile off disk & looking at them one by one likely lead to operators & devs (who have go to thru operators today (an anti-pattern) not always looking at them, esp. in case of OOMs causing low impact.

Can we think of more problems?

Possible ways forward

Continuous profiling. Would this help us with the kinds of OOMs we see in practice? I don't think it's likely but still it'd be good to have continuous profiling. @jordanlewis mentions polar signals as a possible option.
If we knew queries, jobs, etc. executing at crash time, we'd be more likely to root cause OOMs, even without perfecting profiling, IMHO: https://github.com/cockroachdb/cockroach/issues/52815. Maybe we could store some mem stats stuff in the crash dump also?
Dump memory instead of taking a heap profile at crash time. For example: https://golang.org/pkg/runtime/debug/#WriteHeapDump. Note that golang appears to have no tools to analyze such a dump. Also it suspends execution of all goroutines. But we're crashing anyway, right? Another idea is a core dump but my possibly incorrect understanding is such a dump is raw mem & thus not transparent enough to use to root cause OOM, at least with available golang tooling.
Soft limits can be set at the k8s level (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds), leading to k8s sending SIGTERM ahead of SIGKILL in case of eviction. CRDB can trap this signal and take profiles, dump mem, etc. Pretty sure re: SIGTERM; the docs are a little vague. (How would we tell difference between soft mem eviction & CRDB patch?)
Similarly, we can use hook into signals of cgroup mem pressure and take profiles, dump mem, etc. ahead of OOM as per @jordanlewis: https://github.com/cockroachdb/cockroach/issues/64965
Similarly, we can set up a user-space OOM killer like oomd as per @andreimatei and if it fires take profiles, dump mem, etc. ahead of system OOM: https://github.com/facebookincubator/oomd
Similarly, as @andreimatei mentions, we can enable swap to buy more time before OOM, then use that time to take profiles, dump mem, etc., except UGH k8s doesn't support swap: https://github.com/kubernetes/kubernetes/issues/53533

Note that latter items (4-7) are similar; the idea is to get some signal that OOM is coming before OOM and do useful stuff with that time.

Create a memory ballast in form of some pod that uses IDK 300 MB of mem; it is setup to be killed by k8s / OOM killer first, buying us more time to take profiles, dump mem, etc. (This idea might make more sense in a serverless context where customer-facing costs are not so high.)
Do something even crazier like turn off linux mem over-commit & the OOM killer (https://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown), fork goruntime to dump mem if can't allocate more mem, only crash after that dumping mem. This idea is not gonna work as described (my understanding is system processes can die leading to k8s node stability issues (worse than CRDB OOM)) but I mention it to point out that 3-6 make it more likely to a profile or heap dump but if growth is fast enough, none is sufficient. Bad ideas can be turned into good ideas sometimes. That's what my parents tell me at least...

Other ideas? Crazy long term ideas? Scrappy ideas? In-between ideas??

At some level, I am frustrated by how hard it is to dump memory in a legible way at OOM time. Seems like a useful thing for the whole damn world. Right??? Perhaps this statement is out of touch with the realities of the problem (memory IS over-committed) but... I am a user in this case and users are out of touch by design.

My gut, ignoring eng resource contraints, is:

Do 1 tho it might not help with OOM specifically: Continuous profile. Do 2: Crash dump with running queries, etc. Do 3: Dump mem. But confidence not high on this one / how we could make it work. Thoughts? Do one of 4-7... I don't know which; they achieve ~the same goal; the devil is in the details re: how they interact with k8s & we should make sure we think about that. Do 8 if needed? Hopefully not needed.

Or think of a much much better approach somehow.

Epic CRDB-7327

Jira issue: CRDB-7442

All these suggestions are very good. This looks like a research initiative though, where we need to try multiple things side-by-side and see what sticks the best.

I'm not sure how to insert such a research initiative on our roadmap though.

on the one hand, the new obs infra team does have a roadmap item to "investigate techniques to monitor RAM usage" and they are working on this already ( @abarganier spearheading )
OTOH, this research is exploratory and I fear that putting that specific roadmap item under pressure to deliver QoL improvements for you in the short term is going to dissipate good research energy, by overly focusing on the short term.

It somehow feels like the idea of discrete roadmap items on just 1 team is not the right organizational approach for this. Yet the urgency remains high, it is a business critical problem after all.

@mwang1026 @lunevalex any idea how we could structure this work?

I love all of these ideas!

In my opinion, the true root cause of our lack of ability to root cause OOMs is the lack of reachability analyzer for Go heaps, either live or as dumps.

A reachability analyzer is a tool that allows you to explore a heap to understand what is causing an object to be retained, by traversing the graph of objects in the program from its "GC roots", which are memory locations in the program like the active stacks and vars. A program called YourKit for Java was the best implementation of this that I've personally seen. Here is the documentation page on the analysis: https://www.yourkit.com/docs/java/help/merged_paths.jsp

The problem with heap profiles is, of course, that they only record information about where an object was allocated. And zero information about how it was retained. Without this information, I feel like searching for memory management issues is like staring at a maze in the dark with a spotlight on the end of the maze only :)

Another idea is a core dump but my possibly incorrect understanding is such a dump is raw mem & thus not transparent enough to use to root cause OOM.

A core dump can be opened with delve, but delve has no facility to traverse or analyze the heap besides inspecting individual goroutines. So it's hard, without additional tooling that I don't think exists, to use these effectively.

That being said! If we used the userspace oomkiller and enabled core dumps, we could at least open the core dumps in delve to see what the goroutines were up to. This would allow us to get the information about which queries were running, for example, and any other information about what goroutines were up to. It would just be labor intensive, since delve doesn't give you tools to analyze goroutines in bulk - it's a one at a time thing, and there are of course thousands or 10ks of goroutines in a CockroachDB instance.

Perhaps we could write a program to do this in an automated fashion to collect some information about what was happening in the instance. Or even perhaps traverse the entire object graph and create some kind of reachability analysis graph ourselves. But I think this would be significant effort.

Delve does have an API and you can run delve in a headless mode in order to programmatically navigate a core dump. I'm not sure what limitations there are on what can be done via the API, but certainly seems worth exploring. Another possibility is for us to engage with the Delve team and encourage them to build a reachability analyzer.

@mwang1026 @lunevalex any idea how we could structure this work?

I think what'd be helpful for me is which teams would need to be involved. We'd separate it out into discrete "discovery" and "delivery" projects and manage time commitment / expectations accordingly. But even before scheduling work I think identifying who we'd want to be involved and then what that time commitment might look like for discovery would be my suggestion on first step.

This would allow us to get the information about which queries were running, for example, and any other information about what goroutines were up to.

💯 tho it is true we can get queries running specifically with more usability via https://github.com/cockroachdb/cockroach/issues/52815? Still goroutine dump is nice.

Perhaps we could write a program to do this in an automated fashion to collect some information about what was happening in the instance. Or even perhaps traverse the entire object graph and create some kind of reachability analysis graph ourselves.

💯

Another possibility is for us to engage with the Delve team and encourage them to build a reachability analyzer.

💯

I think what'd be helpful for me is which teams would need to be involved.

Here's my best shot at that, @mwang1026. Managers like @jordanlewis plz chime in.

Continuous profiling.
- this is CC thing more than a CRDB thing, tho a small CRDB code change may be needed
- SRE, intrusion dev, and/or observability infra
- feasibility = high.
Crash dumps: https://github.com/cockroachdb/cockroach/issues/52815
- this is a CRDB thing
- observability infra and/or some SQL team? or server?
- feasibility = high.
Dump mem (https://golang.org/pkg/runtime/debug/#WriteHeapDump, core dump, etc.), figure out what is possible with delve, convince someone to build reachability analyzer, build it ourselves, etc.
- this is a CRDB thing
- observability infra?
- very much a research project; prob high ROI; item 3 should be broken down more

NOTE: 4-7 are roughly after the same goal: Have time to take profile / dump mem before OS kills ya.

Soft limits set at k8s level
- this is a CC k8s thing
- observability infra? SRE? intrusion dev?
- risky: which of 4-7 is best is unclear; this is where risk is
Hook into signals of cgroup mem pressure (https://github.com/cockroachdb/cockroach/issues/64965)
- this is a CRDB thing but need to think deeply about k8s environment
- observability infra? server?
- risky: which of 4-7 is best is unclear; this is where risk is
User-space OOM killer like oomd as per @andreimatei and if it fires take profiles, dump mem, etc. ahead of system OOM: https://github.com/facebookincubator/oomd
- this is prob a k8s deploy task
- observability infra? SRE? intrusion dev?
- risky: which of 4-7 is best is unclear; this is where risk is
Swap at OOM
- REJECT. doesn't work in k8s land due to https://github.com/kubernetes/kubernetes/issues/53533
Mem ballast
- this is tiny service run on k8s
- SRE? intrusion dev?
- easy but also not useful without a bunch of above work & not clear how useful it would be at this point
Do something even crazier like turn off linux mem over-commit & the OOM killer (https://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown), fork goruntime to dump mem if can't allocate more mem, only crash after that dumping mem. This is a placeholder for "think of a better" idea!!

Excellent write up! This problem fits well into the overarching initiative to improve CPU & RAM observability, so we will be sure to factor it into discussions around prioritization of research efforts.

It somehow feels like the idea of discrete roadmap items on just 1 team is not the right organizational approach for this. Yet the urgency remains high, it is a business critical problem after all.

I agree, although I think it's okay for a single team to at least be the continuous driver of coordination, initial research, etc. Additional teams that are close to the problem space can then be included when researching the various potential options to parallelize effort. Prioritization is tricky though - as you say all of these issues are indeed important. I think once we prioritize and choose a problem or two to focus on, it will be easier to identify which teams can contribute and schedule effort. Unfortunately, as you can tell in @joshimhoff's list above, I think there's a lot of overlap in the teams that would be involved from one problem to another (e.g. obs. teams, intrusion, sre) so that could limit parallelization.

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

Consistently root cause OOMs, at least in a cloud context #65127