Closed Wondertan closed 11 months ago
Right now, I am collecting profiles for further investigation
Running it with pprof for a day. No suspicious memory growth was noticed, as well nothing leaky in the profiles.
The only concerning thing is that Badger eats 80% of RAM in various formse
Running for another 24 hours with continuous profiling. RAM usage is stable at 1.22 GB with a single spike to 1.91 GB.
Clearly, the leak does not happen under normal conditions, and some edge case has to be triggered to see it. It's unclear what exactly one has to do to trigger it, and I don't see any better idea than to continue profiling and observing.
New update. My node oomed again during badger compaction, which bursts ram usage. The node uses around 3GB of RAM right before the compaction, and most of the profiles lead to badger allocations. We have yet to investigate if this is a badgerv4 bug or if we incorrectly use it.
Talking to @Jorropo, he said they observe similar behavior with badger.
Just wanted to add that I am experiencing this issue on Arabica using rc6.1. My server has 16GB of memory and the celestia light process will slowly consume until it's OOM
I am running the binary again with gomemlimit=2gib and gogc=25 to make runtime more conservative about ram usage
Such a low GOMEMLIMIT is equal to:
for {
runtime.Gc()
time.Sleep(time.Second)
}
@Jorropo, I meant Gib, updated the comment
This so far seems to be the leakiest of them all. It looks like badger txs are not cleaned up after Commit. Particularly, the trace comes from peerstoreds
's flush.
So seems like there might be a bug in the DS-backed store, or it's the bug in Badger that was fixed in main, but not in the release we use.
Did you notice any instability after switching to pebble? I'd be interested in building rc6.1 with pebble swapped in if that's possible by an end user (like with a go.mod dependency swap)
(I need to keep my LN running 24/7 for Dymension and it keeps going down, so I'm excited for a solution)
Did you notice any instability after switching to pebble
I think it's too early to make any conclusions on stability. I need to run it for at least a week to be sure we don't have any side-effects after switching.
Update on the badger:
The code leaks in places where we write a single value to the Badger datastore directly, with Put
. In places where we use Batch
with multiple written it works well.
As an experiment, I tried using Batch
inside of Put
in go-ds-badger4
adapter, but it didn't yield any result. Profiles show that badger still leaks transactions inside of used batches.
At this point, the only clue we have is a single entry written vs multiple at the same time and this is likely to be a bug inside of the Badger.
On pebble:
Simply migrating to pebble made our node consume constantly around 300mibs for LN even during compactions. This is a good result, but we should not rush with migrating as its very disruptive. Also, considering that LNs won't have a high write-load we may also consider a different KV store engine that is not LSM-tree based and much simpler, likely B+ tree.
For now, I think we should focus on figuring out a simple patch for Badger and considering KV store migration in a separate issue.
New update on badger:
We were able to break the following by writing an additional entry every time we did a single Put
. Even when we write two entries it still leaks.
At this point, the only clue we have is a single entry written vs multiple at the same time and this is likely to be a bug inside of the Badger.
However, there is good news. We were able to accelerate the memory growth by rewriting the value for the same key over and over again. This also aligns with leaking code paths, where we update the state of something with a constant key. So there is enough proof at this point that the bug is in Badger.
The next step is to write a unit test for Badger that shows that and open an issue on their side.
The bug is in DetectConflicts
feature of Badger. By disabling it, I could get stable RAM usage for over an hour already.
🎉
This aligns with the observation that rewriting the same key causes issues.
New final update:
The bug is not in DetectConflicts
as I thought. Disabling DetectsConflicts
just removes one of the leaks caused by the bug and it still had to be accompanied by some other case because everything was fine when I worked with Badger directly. A pure Badger test for overwriting the same key didn't show any leaks.
But right after DetectConflicts
finding, I kept playing with the Badger to the point where I reduced RAM usage during sync up to 10 times(more on that later). This allowed me to detect another more minor leak in Badger that I hadn't realized but had seen before.
After a day of pure debugging, I found the true culprit of all the problems: an edge case in Badger. Check the patch for it here. Basically, when we flush an empty Batch, the transaction inside of the leak is not marked as read. Flushing of an empty Batch happens in DHT's provider manager. When the manager loads the provider set via ds.Query
, internally the autobatch
wrapper does the Flush
. As this query happens on startup, the batch is empty and thus the issue.(cc @Jorropo, this should explain why IPFS observes the leak with Badger as well)
Badger internals of the bug: When we flush a datastore
Batch
with no data, it would not notify the internal synchronization primitive(Watermark
) that the read side of the transaction has ended. The bug surprisingly does not leak the transaction or batch directly, but does not free up resources inside of theWatermark
and does not progress theWatermark
internal monotonic counter. The leak inside of theWatermark
is the minor leak I mentioned in the previous paragraph, and the stalled counter causes theDetectConflicts
feature to leak, which is the original leak.
The mystery is solved, and now we need to land the patch and apply all the optimizations I've collected along the way.
P.S. Debug branches:
This bug has existed since the start of the celestia-node, literally. We faced it multiple times whenever we played with Badger:
y.Watermark
mem usage in various profiles in the life span of celestia-node, but I never really analyzed this nor realized it's an issue. Just thought it was a Badger thing. In fact, the y.Watermark
should not be visible in profiles.Yes the watermark has to do w GC and we explored seeing watermark in pyroscope before. We fixed the leak but then we changed the badger parameters again recently so it's no longer fixed I assume
We fixed the leak but then we changed the badger parameters again recently so it's no longer fixed I assume
The leak has always been there(see above) various badger params only make it less or more severe.
I found my mainnet LN getting OOMed a day ago on 8GB DO instance. @renaynay, found here BN ooming on BN with 16GB instance. There is definitely a leak somewhere, and it's likely to exist in the FN as well.
At this point, there is not much info on why it happens, as well as why we haven't seen this on other networks like Arabica, Mocha, and Robusta