Open shailendra-patel opened 1 month ago
Hi @shailendra-patel, please add branch-* labels to identify which branch(es) this C-bug affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Hi @shailendra-patel, please add branch-* labels to identify which branch(es) this GA-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/cdc
just had a look from the cdc side. some observations
looking at (go) mem profiles, we have a lot more inuse mem in pebble compactions during the bad time, up to 25%. speaks to compaction issues
changefeeds were having trouble catching up -- 2 of them never finished before we canceled them -- which is odd because they were created with initial_scan=no :joy: here's what i think was happening:
i'm not sure if changefeeds are entirely to blame for the oomkill given the first point about pebble compactions. even if that's a non-issue, i think there is probably room for improvement here, but i don't see evidence of a ga-blocking bug.
@wenyihu6 / @rharding6373 do you concur?
We're sending this to storage -- assigned @itsbilal since you're oncall.
The changefeed activity seems suboptimal (we have some takeaways here for future recommendations), but due to the mem usage due to compactions and that changefeeds seem to be stuck in catch up scans (these are made directly to the storage layer), it seems like storage is where the next steps of the investigation should be.
Soundsg good - thanks for the reassign @rharding6373 . There's a lot more memory usage in memtables on this node than there should ever be:
| | | | ingested | moved | written | | amp | multilevel
level | tables size val-bl vtables | score | in | tables size | tables size | tables size | read | r w | top in read
------+-----------------------------+-------+-------+--------------+--------------+--------------+-------+----------+------------------
0 | 0 0B 0B 0 | 0.00 | 188GB | 1.1K 970KB | 0 0B | 16K 19GB | 0B | 0 0.1 | 0B 0B 0B
1 | 0 0B 0B 0 | 0.00 | 0B | 0 0B | 0 0B | 0 0B | 0B | 0 0.0 | 0B 0B 0B
2 | 18 52MB 3.3MB 0 | 0.96 | 8.9GB | 125 111KB | 3.0K 9.2GB | 53K 162GB | 162GB | 1 18.2 | 0B 0B 0B
3 | 53 261MB 34MB 0 | 0.92 | 11GB | 327 290KB | 1.9K 4.8GB | 7.8K 28GB | 29GB | 1 2.5 | 0B 0B 0B
4 | 161 1.6GB 356MB 1 | 0.84 | 17GB | 550 498KB | 382 1023MB | 17K 160GB | 161GB | 1 9.6 | 2.0GB 6.0GB 21GB
5 | 634 2.3GB 554MB 2 | 1.00 | 15GB | 3.8K 4.9MB | 158 368MB | 133K 497GB | 499GB | 1 34.0 | 356MB 2.7GB 8.5GB
6 | 6.2K 167GB 2.5GB 57 | - | 11GB | 20K 224GB | 294 44MB | 24K 1009GB | 1.0TB | 1 93.7 | 767KB 26MB 58MB
total | 7.0K 171GB 3.4GB 60 | - | 412GB | 26K 224GB | 5.8K 15GB | 252K 2.2TB | 1.9TB | 5 5.6 | 2.3GB 8.7GB 30GB
---------------------------------------------------------------------------------------------------------------------------------------
WAL: 1 files (48MB) in: 186GB written: 188GB (1% overhead) failover: (switches: 6, primary: ‹22h19m21.694006978s›, secondary: ‹59.301058386s›)
Flushes: 5296
Compactions: 48680 estimated debt: 0B in progress: 0 (0B)
default: 17201 delete: 2590 elision: 6356 move: 5835 read: 45 tombstone-density: 16653 rewrite: 0 copy: 0 multi-level: 1044
MemTables: 1 (64MB) zombie: 50 (3.1GB)
Zombie tables: 2410 (29GB, local: 29GB)
Backing tables: 38 (1.8GB)
Virtual tables: 60 (1.4GB)
Local tables size: 172GB
Compression types: snappy: 6975 unknown: 60
Block cache: 0 entries (0B) hit rate: 77.1%
Table cache: 78K entries (60MB) hit rate: 99.9%
Secondary cache: 0 entries (0B) hit rate: 0.0%
Snapshots: 3 earliest seq num: 1019853557
Table iters: 621
Filter utility: 72.8%
Ingestions: 3605 as flushable: 928 (53GB in 6493 tables)
Cgo memory usage: 24GB block cache: 541MB (data: 392MB, maps: 148MB, entries: 5.4KB) memtables: 23GB
Closer inspection: we're at ~3-4 GB in zombie memtables per store, which adds up to ~23-24 GB in excess memtables across the 8 stores combined, which crowds out any other block cache usage and significantly slows down all disk reads. This was appearing to us as slow compactions, which resulted in more concurrent compactions. But the compactions themselves aren't the issue (other than the fact that we probably should limit compaction concurrency on this cluster to 2/1 per store down from the default 3. I remember that being in the drt-large runbook but the env var was not set on the cluster during this incident).
It's likely that whatever was happening on the cluster at a higher level at that time is not adhering to good iterator reuse semantics. If we keep an iterator open for a long time, we accrue higher memory usage in zombie memtables. If we want to observe a consistent Pebble state for a longer period of time, the right primitive for that is snapshots, not iterators.
In these experiments, we limited catch-up iterators per store to 64 since that is what we recommended here https://github.com/cockroachdb/replicator/wiki/Changefeed-Performance#low-latency-configuration and in the changefeed best practise for large clusters guide https://docs.google.com/document/d/1AxLmwEvs68G6ClEq2Klso6ZsTSA-tmN6IMwOFPzvJuM/edit?tab=t.0#heading=h.wp6vsgdab22a. It is possible that we are opening too many at once. In addition, some of the catch up scans were observed taking a long time to complete. This could be a contributing factor.
Thank you @itsbilal, @wenyihu6, and @asg0451 for your investigation! To summarize the conclusion, 64 iterators x 8 stores is too many when there are 300k ranges and lots of chaos. This behavior already exists in older versions, and default iterator limit is 16, so we don't believe it's a GA blocker.
There is still a scalability issue to be fixed here, which is that when the number of iterators, changefeeds, and ranges go up the node could OOM. @itsbilal suggested using snapshots instead of an iterator for catch up scans in this thread: https://cockroachlabs.slack.com/archives/C05FHJJ0MD0/p1730322221146769?thread_ts=1730271321.570269&cid=C05FHJJ0MD0
Sending to KV to evaluate a fix, since KV owns rangefeeds, which is where the iterator is created and managed as part of catch up scans.
Thanks for cleaning up the issue @rharding6373 ! In particular, it'd be good if the rangefeed could create a snapshot where it currently creates an iterator, and then just have shorter-lived iterators spawned off of that snapshot every time it scans N keys (after which it could close/reopen an iterator). That would avoid the excessive amount of pinning of memtables that we saw in this issue, as most of the iterators created off of the snapshot would not be pinning any memtables. The difference would be most evident in scans that take longer, such as when initial_scan=true
See uses of Pebble.NewEventuallyFileOnlySnapshot()
such as in the consistency checker for how this could work:
On the
drt-large
cluster, node 12 was killed due to an Out of Memory (OOM) error. Prior to the OOM event, there was a noticeable increase in cgo memory usage. At the time of this event, 100 changefeeds were running and experiencing lag.Memory and cpu profiles attached around the time of OOM. memory-analysis.zip
Related discussion in slack thread.
Jira issue: CRDB-43801