kvserver: raft OOM when catching up node with many ranges and large rows - Githubissues

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.11k stars 3.81k forks source link

kvserver: raft OOM when catching up node with many ranges and large rows #105338

Open erikgrinaker opened 1 year ago

erikgrinaker commented 1 year ago

While working on #103288, I spun up a 3-node n2-highcpu-8 cluster (8 vCPUS, 8 GB memory) with a bank workload writing 10 KB rows across 35k ranges. After some time, I took down one of the nodes for about 30 minutes. When I reintroduced it to the cluster, it continually OOMed on startup, with heap profiles showing all memory usage came from Raft request decoding (likely MsgApps kept around in the unstable log). Increasing memory from 8 GB to 32 GB was not sufficient to resolve the OOMs.

Rough repro:

roachprod create -n 4 --gce-machine-type n2-highcpu-8 --local-ssd=false --gce-pd-volume-size 2000 grinaker-lease

roachprod start grinaker-lease:1-3

SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests = 4;

./cockroach workload init bank --rows 100000000 --ranges 35000 --payload-size 10000 --data-loader IMPORT $PGURLS
./cockroach workload run bank --rows 100000000 --batch-size 100 --payload-size 10000 --concurrency 64 $PGURLS

SET CLUSTER SETTING kv.expiration_leases_only.enabled = true;

Let the workload run for 20 minutes. Stop one of the nodes, keep it down for 20 minutes, restart it.

35k ranges is probably excessive here, try e.g. 20k ranges. kv0 with large rows/batches probably does the trick too. The initial import here will take 5 hours, a smaller initial dataset probably works too.

Jira issue: CRDB-28990

Epic CRDB-39898

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

williamkulju commented 11 months ago

Rediscovered this issue on the 23.2 scale test cluster. Thread with details is here.

lyang24 commented 7 months ago

Trying to initialize the bank workload on our internal infrastructure i believe this is a bug with crdb workload. Let me open a separate issue. cc @pav-kv

(base) eyang@HQ-C02FC4D6MD6T scripts % cockroach workload init bank --rows 1000000000 --ranges 350000 --payload-bytes 10000 --data-loader IMPORT {URL"
I240318 23:01:38.484242 1 ccl/workloadccl/fixture.go:342  [-] 1  starting import of 1 tables
Error: importing fixture: importing table bank: pq: at or near "(": syntax error

lyang24 commented 7 months ago

noting some other observations on pausing node during testing - the workload interrupted with this message when i stopped a node

Error: pq: result is ambiguous: error=ba: Put [/Table/308/1/87236004/0], EndTxn(parallel commit) [/Table/308/1/87236004/0], [txn: 1234c4f3] RPC error: grpc: error reading from server: read tcp 10.138.46.12:49619->10.142.36.22:26856: use of closed network connection [code 14/Unavailable] [propagate] (last error: transaction 1234c4f3-f121-48ac-949d-d4aeb22b0a13 with sequence 1 prevented from changing write timestamp from 1710956484.440837718,0 to 1710956489.408392875,2 due to ambiguous replay protection

lyang24 commented 7 months ago

The smallest testing cluster I could find is a 30 node multi region cluster (3 dc 10 per dc). during testing I took out two nodes and rejoin them back after 20 mins their memory usage is significantly higher than the rest of nodes (go allocated 18gb vs 2gb) once they join and it slow down afterward. Captured some memory profiles will share on separate channel. normalnode unpaused_node

lyang24 commented 7 months ago

node30heap.pb.gz node5heap.pb.gz node30heapRecentAlloc.pb.gz

lyang24 commented 7 months ago

After digging through the /heap_profiler folder we found memory quickly claims from 1.5 GB from first profile to 10 GB last profile with 50 seconds, the 10.62kB allocation on raftpb.unmarshal is really growing. I have also observed during test this memory spike get flattened to 1gb normal level fairly quickly

the 10 gb allocation profile002

first profile memprof.2024-03-20T19_16_03.306.9703810216.pprof.gz

last profile memprof.2024-03-20T19_16_53.251.19656000968.pprof.gz

pav-kv commented 7 months ago

Thanks @lyang24. The last profile does look like a repro of this issue. Did you observe other effects this has on the cluster? For example, higher tail latencies, Go scheduling latency, etc.