kv/rangefeed: use snapshots instead of iterators for rangefeed catchup scans

shailendra-patel commented 1 month ago

On the drt-largecluster, node 12 was killed due to an Out of Memory (OOM) error. Prior to the OOM event, there was a noticeable increase in cgo memory usage. At the time of this event, 100 changefeeds were running and experiencing lag.

Memory and cpu profiles attached around the time of OOM. memory-analysis.zip

Related discussion in slack thread.

Jira issue: CRDB-43801

blathers-crl[bot] commented 1 month ago

Hi @shailendra-patel, please add branch-* labels to identify which branch(es) this C-bug affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 1 month ago

Hi @shailendra-patel, please add branch-* labels to identify which branch(es) this GA-blocker affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/cdc

asg0451 commented 1 month ago

just had a look from the cdc side. some observations

looking at (go) mem profiles, we have a lot more inuse mem in pebble compactions during the bad time, up to 25%. speaks to compaction issues

changefeeds were having trouble catching up -- 2 of them never finished before we canceled them -- which is odd because they were created with initial_scan=no :joy: here's what i think was happening:

when a feed is created with initial_scan=no it sets its start time to the create stmt ts, then starts up a rangefeed from that time
there's a limit on how many rangefeed catchup scans can be ongoing at once. default is 16, we had it set to 64 in drt-large. for reference we were trying to watch 300k ranges.
as feeds are blocked waiting for that semaphore, their start time grows more and more noncurrent, effectively making them do some initial_scan work
there were a lot of rangefeed restarts due to splits & merges (presumably operational chaos?)
so changefeeds were just mostly waiting for their turn to run increasingly-long catchups, while those that were running would often fail due to operational chaos

i'm not sure if changefeeds are entirely to blame for the oomkill given the first point about pebble compactions. even if that's a non-issue, i think there is probably room for improvement here, but i don't see evidence of a ga-blocking bug.

@wenyihu6 / @rharding6373 do you concur?

rharding6373 commented 1 month ago

We're sending this to storage -- assigned @itsbilal since you're oncall.

The changefeed activity seems suboptimal (we have some takeaways here for future recommendations), but due to the mem usage due to compactions and that changefeeds seem to be stuck in catch up scans (these are made directly to the storage layer), it seems like storage is where the next steps of the investigation should be.

itsbilal commented 1 month ago

Soundsg good - thanks for the reassign @rharding6373 . There's a lot more memory usage in memtables on this node than there should ever be:

      |                             |       |       |   ingested   |     moved    |    written   |       |    amp   |     multilevel
level | tables  size val-bl vtables | score |   in  | tables  size | tables  size | tables  size |  read |   r   w  |    top   in  read
------+-----------------------------+-------+-------+--------------+--------------+--------------+-------+----------+------------------
    0 |     0     0B     0B       0 |  0.00 | 188GB |  1.1K  970KB |     0     0B |   16K   19GB |    0B |   0  0.1 |    0B    0B    0B
    1 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0 |    0B    0B    0B
    2 |    18   52MB  3.3MB       0 |  0.96 | 8.9GB |   125  111KB |  3.0K  9.2GB |   53K  162GB | 162GB |   1 18.2 |    0B    0B    0B
    3 |    53  261MB   34MB       0 |  0.92 |  11GB |   327  290KB |  1.9K  4.8GB |  7.8K   28GB |  29GB |   1  2.5 |    0B    0B    0B
    4 |   161  1.6GB  356MB       1 |  0.84 |  17GB |   550  498KB |   382 1023MB |   17K  160GB | 161GB |   1  9.6 | 2.0GB 6.0GB  21GB
    5 |   634  2.3GB  554MB       2 |  1.00 |  15GB |  3.8K  4.9MB |   158  368MB |  133K  497GB | 499GB |   1 34.0 | 356MB 2.7GB 8.5GB
    6 |  6.2K  167GB  2.5GB      57 |     - |  11GB |   20K  224GB |   294   44MB |   24K 1009GB | 1.0TB |   1 93.7 | 767KB  26MB  58MB
total |  7.0K  171GB  3.4GB      60 |     - | 412GB |   26K  224GB |  5.8K   15GB |  252K  2.2TB | 1.9TB |   5  5.6 | 2.3GB 8.7GB  30GB
---------------------------------------------------------------------------------------------------------------------------------------
WAL: 1 files (48MB)  in: 186GB  written: 188GB (1% overhead) failover: (switches: 6, primary: ‹22h19m21.694006978s›, secondary: ‹59.301058386s›)
Flushes: 5296
Compactions: 48680  estimated debt: 0B  in progress: 0 (0B)
             default: 17201  delete: 2590  elision: 6356  move: 5835  read: 45  tombstone-density: 16653  rewrite: 0  copy: 0  multi-level: 1044
MemTables: 1 (64MB)  zombie: 50 (3.1GB)
Zombie tables: 2410 (29GB, local: 29GB)
Backing tables: 38 (1.8GB)
Virtual tables: 60 (1.4GB)
Local tables size: 172GB
Compression types: snappy: 6975 unknown: 60
Block cache: 0 entries (0B)  hit rate: 77.1%
Table cache: 78K entries (60MB)  hit rate: 99.9%
Secondary cache: 0 entries (0B)  hit rate: 0.0%
Snapshots: 3  earliest seq num: 1019853557
Table iters: 621
Filter utility: 72.8%
Ingestions: 3605  as flushable: 928 (53GB in 6493 tables)
Cgo memory usage: 24GB  block cache: 541MB (data: 392MB, maps: 148MB, entries: 5.4KB)  memtables: 23GB

itsbilal commented 1 month ago

Closer inspection: we're at ~3-4 GB in zombie memtables per store, which adds up to ~23-24 GB in excess memtables across the 8 stores combined, which crowds out any other block cache usage and significantly slows down all disk reads. This was appearing to us as slow compactions, which resulted in more concurrent compactions. But the compactions themselves aren't the issue (other than the fact that we probably should limit compaction concurrency on this cluster to 2/1 per store down from the default 3. I remember that being in the drt-large runbook but the env var was not set on the cluster during this incident).

It's likely that whatever was happening on the cluster at a higher level at that time is not adhering to good iterator reuse semantics. If we keep an iterator open for a long time, we accrue higher memory usage in zombie memtables. If we want to observe a consistent Pebble state for a longer period of time, the right primitive for that is snapshots, not iterators.

wenyihu6 commented 1 month ago

In these experiments, we limited catch-up iterators per store to 64 since that is what we recommended here https://github.com/cockroachdb/replicator/wiki/Changefeed-Performance#low-latency-configuration and in the changefeed best practise for large clusters guide https://docs.google.com/document/d/1AxLmwEvs68G6ClEq2Klso6ZsTSA-tmN6IMwOFPzvJuM/edit?tab=t.0#heading=h.wp6vsgdab22a. It is possible that we are opening too many at once. In addition, some of the catch up scans were observed taking a long time to complete. This could be a contributing factor.

rharding6373 commented 1 month ago

Thank you @itsbilal, @wenyihu6, and @asg0451 for your investigation! To summarize the conclusion, 64 iterators x 8 stores is too many when there are 300k ranges and lots of chaos. This behavior already exists in older versions, and default iterator limit is 16, so we don't believe it's a GA blocker.

There is still a scalability issue to be fixed here, which is that when the number of iterators, changefeeds, and ranges go up the node could OOM. @itsbilal suggested using snapshots instead of an iterator for catch up scans in this thread: https://cockroachlabs.slack.com/archives/C05FHJJ0MD0/p1730322221146769?thread_ts=1730271321.570269&cid=C05FHJJ0MD0

Sending to KV to evaluate a fix, since KV owns rangefeeds, which is where the iterator is created and managed as part of catch up scans.

itsbilal commented 1 month ago

Thanks for cleaning up the issue @rharding6373 ! In particular, it'd be good if the rangefeed could create a snapshot where it currently creates an iterator, and then just have shorter-lived iterators spawned off of that snapshot every time it scans N keys (after which it could close/reopen an iterator). That would avoid the excessive amount of pinning of memtables that we saw in this issue, as most of the iterators created off of the snapshot would not be pinning any memtables. The difference would be most evident in scans that take longer, such as when initial_scan=true

See uses of Pebble.NewEventuallyFileOnlySnapshot() such as in the consistency checker for how this could work:

https://github.com/cockroachdb/cockroach/blob/0769dd9188f65dce3c4fb4b5b66f906782b3923d/pkg/kv/kvserver/replica_consistency.go#L691

cockroachdb / cockroach

kv/rangefeed: use snapshots instead of iterators for rangefeed catchup scans #133851