cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

Nodes consistently OOMS during index backfill on large cluster #97801

Closed rhu713 closed 11 months ago

rhu713 commented 1 year ago

Describe the problem While initializing the TPCE workload on a 96 node cluster with 10 million users, nodes would consistently OOM while backfilling an index on one of the largest tables tpce.public.trade.

This is the heap profile of an example node that was using a lot of memory while creating the index: Screenshot 2023-02-28 at 1 24 25 PM

To Reproduce I created a 96 node cluster (+ 1 workload node)

roachprod create rui-backup-test-large  -n 97 --gce-machine-type n2-standard-16  --local-ssd=false  --gce-pd-volume-size=4096 --gce-zones us-central1-a

on node 97:
roachprod start rui-backup-test-large  --racks=96 --env COCKROACH_ROCKSDB_CONCURRENCY=16 

Started the TPCE workload for 10M customers with --init

sudo docker run cockroachdb/tpc-e:latest --init  --customers=10000000 --racks=96 $(cat hosts.txt) -d 10d

Eventually, the table imports succeed and the indexes will be created. When the index for tpce.public.trade starts, nodes will consistently die due to OOMs and I would have to restart the cockroach process in order for the index job to proceed.

Expected behavior A clear and concise description of what you expected to happen.

Additional data / screenshots debug zip: gs://rui-backup-test/debug/large-cluster/debug.zip tsdump: gs://rui-backup-test/debug/large-cluster/tsdump.gob

Environment:

Jira issue: CRDB-24895

sumeerbhola commented 1 year ago

copy-paste from slack discussion: [sumeer] Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. One problem we postponed solving is early rejecting requests if memory consumption of queued requests was too high (and letting the client retry after backoff). The assumption behind postponing this was that internal work that generates these memory hungry AddSSTables had limited concurrency and therefore implicit flow control. Is that not the case? [ajwerner] That is the case, but on a large enough cluster, the concurrency is high. [sumeer] Because of the fanin? [ajwerner] Yes, every single node ends up blocked on the slow node eventually.

msbutler commented 1 year ago

Two things to note:

  1. Rui did not see these issues in 23.1.
  2. I saw a similar a sawtooth pattern of addsstable requests on the same index backfill on a smaller 22.2 tpce workload, which I documented here https://github.com/cockroachdb/cockroach/issues/95163
ajwerner commented 1 year ago

Much internal discussion which prompted this issue here.

irfansharif commented 1 year ago

Copying some internal notes.

I’m reading this again. @dt and @rhu713, are we planning to run these 96 node TPC-E runs again? I’m four months too late to this thread and (a) the grafana metrics are gone, and (b) lots of things have changed in 23.2: (i) replication admission control, (ii) removal of addsst concurrency limits in https://github.com/cockroachdb/cockroach/pull/104861.

Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. Looking at the heap profile posted in https://github.com/cockroachdb/cockroach/issues/97801, with the concurrency limiter gone, we’ve somewhat reduced the likelihood of OOMs, right? There’s still the fan-in problem, but I wonder if re-running the same experiment will now hit the per-replica proposal quota pool limits of 8MiB, and reduce (but not eliminate) OOM likelihood. We’re no longer queueing on a limiter that’s permitting 1 AddSST at a time (with no view over the memory held by other waiting requests), we’re ingesting them as quickly as AC will let us.

The client<->server protocol changes described in the messages above, they’d still need to happen to completely eliminate server-side OOMs with a large degree of fan-in, right? I can’t really tell how real a problem it is anymore (re-running this same experiment would be the next clarifying step). If clients are only issuing more AddSSTs after previous ones have been processed by the server we’re all fanning into, then after the initial burst of AddSSTs (and depletion of client-side memory budgets), subsequent AddSSTs are only going to be issued in aggregate at the rate the single server is ingesting them, right? So we have flow control? The OOM concerns then are only around the initial burst (which I think should be smaller as far the server is concerned, with the concurrency limiter gone)?

rhu713 commented 1 year ago

I don't think the DR team currently has any plans on rebuilding the 96 node cluster again.

shralex commented 11 months ago

@aadityasondhi could you please see if this is still reproducible as part of the large-scale testing of replication AC ? if it is, we can decide where this belongs. Thank you!

sumeerbhola commented 11 months ago

Regarding https://github.com/cockroachdb/cockroach/issues/97801#issuecomment-1771401835, we didn't have any plans for "large-scale testing of replication AC". Any problems with replication AC should be as reproducible in a small cluster as a large cluster and we have done the small cluster experiments via roachtests. If someone experiments with a large cluster and there are issues, we can investigate. Meanwhile I am closing this. Feel free to reopen if I have misunderstood something.