Open andrewbaptist opened 2 months ago
Thanks for the reproduction! I have tagged this with KV and SQL for initial investigation, and removed admission from the title, since AC does no memory tracking, or admission based on memory.
Glancing at the heap profile, which only accounts for 658MB, all the memory usage is in MVCCScanToBytes
in the pebbleResults
which should be accounted for in the KV memory accounting. It would be good to have a profile that shows the high memory usage.
This query performs a full scan of the kv
table, and each KV BatchRequest
is issued with TargetBytes
limit of 10MiB. When the first BatchRequest is sent, we only pre-reserve 1KiB in the memory accounting system, but when the response comes back, we reconcile the accounting with the actual memory footprint, and for 2nd and all consecutive batches we do keep the reasonable reservation. If we have many queries that start around the same time on the same gateway, then we can issue huge number of such BatchRequests while only pre-reserving a fraction of needed memory, thus overloading the KV server. Ideally, the KV server would protect itself.
In the kvstreamer
client we address this limitation by estimating the expected response size and reserving that up front while reconciling the accounting after having received the response. We could do something similar in the txnKVFetcher
, but I'd rather address #82164 which tracks using the Streamer to power scans (in addition to lookup and index joins). We do have a prototype for that #103659 - it might be an interesting experiment to see how the system behaves on a binary with that prototype.
Describe the problem
Running multiple queries in parallel can cause nodes to OOM.
To Reproduce
1) Create a 3 node cluster initialized with a KV DB. 2) Begin a background workload to write many KV keys
roachprod ssh $CLUSTER:3 "./cockroach workload run kv $(roachprod pgurl $CLUSTER:3) --timeout 5s --tolerate-errors"
3) Wait a little while to allow there to be enough keys written to the system. 4) Start a number of select queries in the background:for n in {1..3}; do for x in {1..100}; do echo "select * from kv.kv where v='abc';" | roachprod sql $CLUSTER:$n & done; done
5) Notice that all the cockroach nodes crashExpected behavior The nodes should not crash. On a muti-tenant system a single tenant system, one tenant could bring down the entire cluster.
Additional data / screenshots Depending on the number of concurrent select(*), we get different behaviors: 10 queries/node - p50 latency on writes jumps from 2ms -> 900ms, QPS goes from 4000 -> 10 20 queries/node - p50 latency goes to 2s, QPS goes to 1-2 40 queries/node - p50 latency goes to 10s, QPS goes to 0 (timeouts). Causes liveness failures. 100 queries/node - all nodes crash with OOM
CPU Profile: profile.pb.gz Heap profile (at 30 concurrency): profile.pb.gz Note that the heap profile doesn't account for all the memory. Here is a line from the cockroach-health log at about the same time as the heap profile:
I240424 19:20:08.898205 324 2@server/status/runtime_log.go:47 ⋮ [T1,Vsystem,n2] 734 runtime stats: 6.2 GiB RSS, 982 goroutines (stacks: 17 MiB), 2.7 GiB/4.1 GiB Go alloc/total (heap fragmentation: 27 MiB, heap reserved: 1.3 GiB, heap released: 1.5 GiB), 2.1 GiB/2.4 GiB CGO alloc/total (9.0 CGO/sec), 394.5/3.1 %(u/s)time, 0.0 %gc (188x), 569 KiB/622 KiB (r/w)net
Environment: This likely cccurs on all releases. This was tested on 24.1/master.
Additional context We have seen customer cases where heavy queries can cause either liveness failures or OOMs
Jira issue: CRDB-38160