kv,sql: expensive select queries cause OOM

andrewbaptist commented 2 months ago

Describe the problem

Running multiple queries in parallel can cause nodes to OOM.

To Reproduce

1) Create a 3 node cluster initialized with a KV DB. 2) Begin a background workload to write many KV keys roachprod ssh $CLUSTER:3 "./cockroach workload run kv $(roachprod pgurl $CLUSTER:3) --timeout 5s --tolerate-errors" 3) Wait a little while to allow there to be enough keys written to the system. 4) Start a number of select queries in the background: for n in {1..3}; do for x in {1..100}; do echo "select * from kv.kv where v='abc';" | roachprod sql $CLUSTER:$n & done; done 5) Notice that all the cockroach nodes crash

[17561.221476] cockroach invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[17629.744740] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-431.scope,task=cockroach,pid=32634,uid=1000
[17629.744773] Out of memory: Killed process 32634 (cockroach) total-vm:1735716kB, anon-rss:206688kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:964kB oom_score_adj:0

Expected behavior The nodes should not crash. On a muti-tenant system a single tenant system, one tenant could bring down the entire cluster.

Additional data / screenshots Depending on the number of concurrent select(*), we get different behaviors: 10 queries/node - p50 latency on writes jumps from 2ms -> 900ms, QPS goes from 4000 -> 10 20 queries/node - p50 latency goes to 2s, QPS goes to 1-2 40 queries/node - p50 latency goes to 10s, QPS goes to 0 (timeouts). Causes liveness failures. 100 queries/node - all nodes crash with OOM

CPU Profile: profile.pb.gz Heap profile (at 30 concurrency): profile.pb.gz Note that the heap profile doesn't account for all the memory. Here is a line from the cockroach-health log at about the same time as the heap profile: I240424 19:20:08.898205 324 2@server/status/runtime_log.go:47 ⋮ [T1,Vsystem,n2] 734 runtime stats: 6.2 GiB RSS, 982 goroutines (stacks: 17 MiB), 2.7 GiB/4.1 GiB Go alloc/total (heap fragmentation: 27 MiB, heap reserved: 1.3 GiB, heap released: 1.5 GiB), 2.1 GiB/2.4 GiB CGO alloc/total (9.0 CGO/sec), 394.5/3.1 %(u/s)time, 0.0 %gc (188x), 569 KiB/622 KiB (r/w)net

Environment: This likely cccurs on all releases. This was tested on 24.1/master.

Additional context We have seen customer cases where heavy queries can cause either liveness failures or OOMs

Jira issue: CRDB-38160

sumeerbhola commented 2 months ago

Thanks for the reproduction! I have tagged this with KV and SQL for initial investigation, and removed admission from the title, since AC does no memory tracking, or admission based on memory.

Glancing at the heap profile, which only accounts for 658MB, all the memory usage is in MVCCScanToBytes in the pebbleResults which should be accounted for in the KV memory accounting. It would be good to have a profile that shows the high memory usage.

yuzefovich commented 2 months ago

This query performs a full scan of the kv table, and each KV BatchRequest is issued with TargetBytes limit of 10MiB. When the first BatchRequest is sent, we only pre-reserve 1KiB in the memory accounting system, but when the response comes back, we reconcile the accounting with the actual memory footprint, and for 2nd and all consecutive batches we do keep the reasonable reservation. If we have many queries that start around the same time on the same gateway, then we can issue huge number of such BatchRequests while only pre-reserving a fraction of needed memory, thus overloading the KV server. Ideally, the KV server would protect itself.

In the kvstreamer client we address this limitation by estimating the expected response size and reserving that up front while reconciling the accounting after having received the response. We could do something similar in the txnKVFetcher, but I'd rather address #82164 which tracks using the Streamer to power scans (in addition to lookup and index joins). We do have a prototype for that #103659 - it might be an interesting experiment to see how the system behaves on a binary with that prototype.

cockroachdb / cockroach

kv,sql: expensive select queries cause OOM #123000