Closed rhu713 closed 11 months ago
copy-paste from slack discussion: [sumeer] Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. One problem we postponed solving is early rejecting requests if memory consumption of queued requests was too high (and letting the client retry after backoff). The assumption behind postponing this was that internal work that generates these memory hungry AddSSTables had limited concurrency and therefore implicit flow control. Is that not the case? [ajwerner] That is the case, but on a large enough cluster, the concurrency is high. [sumeer] Because of the fanin? [ajwerner] Yes, every single node ends up blocked on the slow node eventually.
Two things to note:
Copying some internal notes.
I’m reading this again. @dt and @rhu713, are we planning to run these 96 node TPC-E runs again? I’m four months too late to this thread and (a) the grafana metrics are gone, and (b) lots of things have changed in 23.2: (i) replication admission control, (ii) removal of addsst concurrency limits in https://github.com/cockroachdb/cockroach/pull/104861.
Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. Looking at the heap profile posted in https://github.com/cockroachdb/cockroach/issues/97801, with the concurrency limiter gone, we’ve somewhat reduced the likelihood of OOMs, right? There’s still the fan-in problem, but I wonder if re-running the same experiment will now hit the per-replica proposal quota pool limits of 8MiB, and reduce (but not eliminate) OOM likelihood. We’re no longer queueing on a limiter that’s permitting 1 AddSST at a time (with no view over the memory held by other waiting requests), we’re ingesting them as quickly as AC will let us.
The client<->server protocol changes described in the messages above, they’d still need to happen to completely eliminate server-side OOMs with a large degree of fan-in, right? I can’t really tell how real a problem it is anymore (re-running this same experiment would be the next clarifying step). If clients are only issuing more AddSSTs after previous ones have been processed by the server we’re all fanning into, then after the initial burst of AddSSTs (and depletion of client-side memory budgets), subsequent AddSSTs are only going to be issued in aggregate at the rate the single server is ingesting them, right? So we have flow control? The OOM concerns then are only around the initial burst (which I think should be smaller as far the server is concerned, with the concurrency limiter gone)?
I don't think the DR team currently has any plans on rebuilding the 96 node cluster again.
@aadityasondhi could you please see if this is still reproducible as part of the large-scale testing of replication AC ? if it is, we can decide where this belongs. Thank you!
Regarding https://github.com/cockroachdb/cockroach/issues/97801#issuecomment-1771401835, we didn't have any plans for "large-scale testing of replication AC". Any problems with replication AC should be as reproducible in a small cluster as a large cluster and we have done the small cluster experiments via roachtests. If someone experiments with a large cluster and there are issues, we can investigate. Meanwhile I am closing this. Feel free to reopen if I have misunderstood something.
Describe the problem While initializing the TPCE workload on a 96 node cluster with 10 million users, nodes would consistently OOM while backfilling an index on one of the largest tables
tpce.public.trade
.This is the heap profile of an example node that was using a lot of memory while creating the index:
To Reproduce I created a 96 node cluster (+ 1 workload node)
Started the TPCE workload for 10M customers with
--init
Eventually, the table imports succeed and the indexes will be created. When the index for
tpce.public.trade
starts, nodes will consistently die due to OOMs and I would have to restart the cockroach process in order for the index job to proceed.Expected behavior A clear and concise description of what you expected to happen.
Additional data / screenshots debug zip: gs://rui-backup-test/debug/large-cluster/debug.zip tsdump: gs://rui-backup-test/debug/large-cluster/tsdump.gob
Environment:
Jira issue: CRDB-24895