celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
926 stars 923 forks source link

OOM on Light and Bridge nodes #2905

Closed Wondertan closed 11 months ago

Wondertan commented 1 year ago

I found my mainnet LN getting OOMed a day ago on 8GB DO instance. @renaynay, found here BN ooming on BN with 16GB instance. There is definitely a leak somewhere, and it's likely to exist in the FN as well.

At this point, there is not much info on why it happens, as well as why we haven't seen this on other networks like Arabica, Mocha, and Robusta

Wondertan commented 1 year ago

Right now, I am collecting profiles for further investigation

Wondertan commented 12 months ago

Running it with pprof for a day. No suspicious memory growth was noticed, as well nothing leaky in the profiles.

The only concerning thing is that Badger eats 80% of RAM in various formse

Wondertan commented 12 months ago

Running for another 24 hours with continuous profiling. RAM usage is stable at 1.22 GB with a single spike to 1.91 GB.

Clearly, the leak does not happen under normal conditions, and some edge case has to be triggered to see it. It's unclear what exactly one has to do to trigger it, and I don't see any better idea than to continue profiling and observing.

Wondertan commented 11 months ago

New update. My node oomed again during badger compaction, which bursts ram usage. The node uses around 3GB of RAM right before the compaction, and most of the profiles lead to badger allocations. We have yet to investigate if this is a badgerv4 bug or if we incorrectly use it.

Talking to @Jorropo, he said they observe similar behavior with badger.

boobcactus commented 11 months ago

Just wanted to add that I am experiencing this issue on Arabica using rc6.1. My server has 16GB of memory and the celestia light process will slowly consume until it's OOM

Wondertan commented 11 months ago

I am running the binary again with gomemlimit=2gib and gogc=25 to make runtime more conservative about ram usage

Jorropo commented 11 months ago

Such a low GOMEMLIMIT is equal to:

for {
 runtime.Gc()
 time.Sleep(time.Second)
}
Wondertan commented 11 months ago

@Jorropo, I meant Gib, updated the comment

Wondertan commented 11 months ago

image

This so far seems to be the leakiest of them all. It looks like badger txs are not cleaned up after Commit. Particularly, the trace comes from peerstoreds's flush.

So seems like there might be a bug in the DS-backed store, or it's the bug in Badger that was fixed in main, but not in the release we use.

Wondertan commented 11 months ago
boobcactus commented 11 months ago

Did you notice any instability after switching to pebble? I'd be interested in building rc6.1 with pebble swapped in if that's possible by an end user (like with a go.mod dependency swap)

(I need to keep my LN running 24/7 for Dymension and it keeps going down, so I'm excited for a solution)

Wondertan commented 11 months ago

Did you notice any instability after switching to pebble

I think it's too early to make any conclusions on stability. I need to run it for at least a week to be sure we don't have any side-effects after switching.

Wondertan commented 11 months ago

Update on the badger: The code leaks in places where we write a single value to the Badger datastore directly, with Put. In places where we use Batch with multiple written it works well.

As an experiment, I tried using Batch inside of Put in go-ds-badger4 adapter, but it didn't yield any result. Profiles show that badger still leaks transactions inside of used batches. image

At this point, the only clue we have is a single entry written vs multiple at the same time and this is likely to be a bug inside of the Badger.

Wondertan commented 11 months ago

On pebble:

Simply migrating to pebble made our node consume constantly around 300mibs for LN even during compactions. This is a good result, but we should not rush with migrating as its very disruptive. Also, considering that LNs won't have a high write-load we may also consider a different KV store engine that is not LSM-tree based and much simpler, likely B+ tree.

For now, I think we should focus on figuring out a simple patch for Badger and considering KV store migration in a separate issue.

Wondertan commented 11 months ago

New update on badger:

We were able to break the following by writing an additional entry every time we did a single Put. Even when we write two entries it still leaks.

At this point, the only clue we have is a single entry written vs multiple at the same time and this is likely to be a bug inside of the Badger.

However, there is good news. We were able to accelerate the memory growth by rewriting the value for the same key over and over again. This also aligns with leaking code paths, where we update the state of something with a constant key. So there is enough proof at this point that the bug is in Badger.

The next step is to write a unit test for Badger that shows that and open an issue on their side.

Wondertan commented 11 months ago

The bug is in DetectConflicts feature of Badger. By disabling it, I could get stable RAM usage for over an hour already. 🎉

This aligns with the observation that rewriting the same key causes issues.

Wondertan commented 11 months ago

New final update:

The bug is not in DetectConflicts as I thought. Disabling DetectsConflicts just removes one of the leaks caused by the bug and it still had to be accompanied by some other case because everything was fine when I worked with Badger directly. A pure Badger test for overwriting the same key didn't show any leaks.

But right after DetectConflicts finding, I kept playing with the Badger to the point where I reduced RAM usage during sync up to 10 times(more on that later). This allowed me to detect another more minor leak in Badger that I hadn't realized but had seen before.

After a day of pure debugging, I found the true culprit of all the problems: an edge case in Badger. Check the patch for it here. Basically, when we flush an empty Batch, the transaction inside of the leak is not marked as read. Flushing of an empty Batch happens in DHT's provider manager. When the manager loads the provider set via ds.Query , internally the autobatch wrapper does the Flush. As this query happens on startup, the batch is empty and thus the issue.(cc @Jorropo, this should explain why IPFS observes the leak with Badger as well)

Badger internals of the bug: When we flush a datastore Batch with no data, it would not notify the internal synchronization primitive(Watermark) that the read side of the transaction has ended. The bug surprisingly does not leak the transaction or batch directly, but does not free up resources inside of the Watermark and does not progress the Watermark internal monotonic counter. The leak inside of the Watermark is the minor leak I mentioned in the previous paragraph, and the stalled counter causes the DetectConflicts feature to leak, which is the original leak.

The mystery is solved, and now we need to land the patch and apply all the optimizations I've collected along the way.

P.S. Debug branches:

Wondertan commented 11 months ago

This bug has existed since the start of the celestia-node, literally. We faced it multiple times whenever we played with Badger:

distractedm1nd commented 11 months ago

Yes the watermark has to do w GC and we explored seeing watermark in pyroscope before. We fixed the leak but then we changed the badger parameters again recently so it's no longer fixed I assume

Wondertan commented 11 months ago

We fixed the leak but then we changed the badger parameters again recently so it's no longer fixed I assume

The leak has always been there(see above) various badger params only make it less or more severe.