eigerco / lumina

Wasm friendly Celestia light node implementation in Rust
Apache License 2.0
103 stars 27 forks source link

bug: synchronization sometimes hangs while still reporting connected peers #256

Open zvolin opened 4 months ago

zvolin commented 2 months ago

I've identified 3 different issues causing this.

panic in libp2p-kad

The kbucket uses std::time::Instant sometimes triggering runtime panic. After that the node is in kinda undefined state, I saw it either stopping logging completely or remaining active only on libp2p-gossipsub but not syncing, not updating peers etc. We managed to get the fix for this in scope of https://github.com/libp2p/rust-libp2p/pull/5347. We'll either have to wait for the 0.54 release of libp2p or try to get a backport.

IndexedDb hanging on committing in append_single_unchecked

Happened to me on firefox. Switching from .commit() to .done() didn't solve this, it hung infinitely. This causes syncer to hang too, and also our UI updates as we await node.syncer_stats() which never resolves. The only thing that helped was clearing the whole Idb store. I didn't see it happening any time later, not sure what caused that.

header-ex request never resolved on libp2p level

With Sessions we make a few requests in parallel (8 currently) and retry the ones which errored out / provided incomplete ranges of headers. It sometimes happen that for a single request we never get any event from the request-response behavior. It should time-out, finish or error, but in this case we just don't get any event. The Session.run() then hangs on

        while self.ongoing > 0 {
            let (height, requested_amount, res) = self.recv_response().await;

with self.ongoing == 1 and syncer waits for it to finish. We could solve this by re-introducing timeouts by hand in our header-ex behavior, however it'd be good to know what's going on in libp2p. I spotted one place which could lead to this bug, but there should be log indicating it which wasn't present in my reproduction. Additional debugging is needed here.

zvolin commented 2 months ago

I have a branch for debugging this here. It uses my fork of libp2p where I already backported the fix for the kbucket, but it's better to clone it locally, update patches to path based to be able to add logs manually in libp2p. It also has the bulk inserts into indexeddb implemented.

When debugging, run the node in chromium. There will be a lot of logs because it's on trace level. Wait for a 2-3 syncer batches to see if it reproduced. If it reproduced then wait half a minute or more, toggle the preserve logs button in dev tools and refresh page to stop the new logs from appearing to not flood you. You can then check the logs, they should persist the refresh. If it didn't reproduce, then just refresh and repeat, I found it hard to debug if I have more logs than from like 2-3 batches because there is a lot of them. Firefox has cool feature of saving all logs to the file too.

oblique commented 1 month ago

Blockers: