babylonlabs-io / finality-provider

Other
6 stars 10 forks source link

FP submission loop stuck when poller `blockInfoChan` is full #53

Open bap2pecs opened 1 month ago

bap2pecs commented 1 month ago

Problem

we deployed our own L2 OP devnet with a private Babylon network. The consumer FP ran out of BBN token last week. Then we topped up the balance and restarted the FP. The issue was found and resolved after ~12 hours.

After, the FP started to fast sync to catch up submitting signatures. We saw fast sync finished in the log:

2024-09-08T16:38:03.271817Z info    fast sync is finished   {"pk": "...", "synced_height": 46346, "last_processed_height": 46346}

Then the FP stopped to submit new signatures from block 46347. We found out that there is no logs of “the finality-provider received a new block, start processing” (in finalitySigSubmissionLoop() ). So it seems there was sth wrong with the chain poller.

Then we noticed that fp.poller.SkipToHeight() was called after fp.logger.Info("fast sync is finished",…)

But it’s weird that we don’t see any logs of “the poller has skipped height(s)“.

Root cause analysis

we found out that the code is stuck here

func (cp *ChainPoller) SkipToHeight(height uint64) error {
    ...
    select {
    case <-cp.quit:
        return fmt.Errorf("the chain poller is stopped")
    case cp.skipHeightChan <- &skipHeightRequest{height: height, resp: respChan}:
    }

        // stuck here
    select {
    case <-cp.quit:
        return fmt.Errorf("the chain poller is stopped")
    case resp := <-respChan:
        return resp.err
    }
}

this is what have happened:

Proposed fix

when FP starts, wait until the fast sync got finished to start the poller.