we deployed our own L2 OP devnet with a private Babylon network. The consumer FP ran out of BBN token last week. Then we topped up the balance and restarted the FP. The issue was found and resolved after ~12 hours.
After, the FP started to fast sync to catch up submitting signatures. We saw fast sync finished in the log:
2024-09-08T16:38:03.271817Z info fast sync is finished {"pk": "...", "synced_height": 46346, "last_processed_height": 46346}
Then the FP stopped to submit new signatures from block 46347. We found out that there is no logs of “the finality-provider received a new block, start processing” (in finalitySigSubmissionLoop() ). So it seems there was sth wrong with the chain poller.
Then we noticed that fp.poller.SkipToHeight() was called after fp.logger.Info("fast sync is finished",…)
But it’s weird that we don’t see any logs of “the poller has skipped height(s)“.
Root cause analysis
we found out that the code is stuck here
func (cp *ChainPoller) SkipToHeight(height uint64) error {
...
select {
case <-cp.quit:
return fmt.Errorf("the chain poller is stopped")
case cp.skipHeightChan <- &skipHeightRequest{height: height, resp: respChan}:
}
// stuck here
select {
case <-cp.quit:
return fmt.Errorf("the chain poller is stopped")
case resp := <-respChan:
return resp.err
}
}
this is what have happened:
after topping up fund, finalitySigSubmissionLoop() entered fast sync mode to catch up to the latest block
at the same time, pollChain() runs in a separate process to poll new blocks and put into the blockInfoChan channel
before fast-sync is finished, the blockInfoChan got full. so pollChain() got stuck
so case req := <-cp.skipHeightChan: won't be called.
in FP, when fast sync finished, poller SkipToHeight() will be called
and it's stuck at the last select statement b/c respChan is never triggered
this is b/c respChan will be used here:
case req := <-cp.skipHeightChan:
// no need to skip heights if the target height is not higher
// than the next height to retrieve
targetHeight := req.height
if targetHeight <= cp.nextHeight {
resp := &skipHeightResponse{
err: fmt.Errorf(
"the target height %d is not higher than the next height %d to retrieve",
targetHeight, cp.nextHeight)}
req.resp <- resp
continue
}
Proposed fix
when FP starts, wait until the fast sync got finished to start the poller.
Problem
we deployed our own L2 OP devnet with a private Babylon network. The consumer FP ran out of BBN token last week. Then we topped up the balance and restarted the FP. The issue was found and resolved after ~12 hours.
After, the FP started to fast sync to catch up submitting signatures. We saw fast sync finished in the log:
Then the FP stopped to submit new signatures from block 46347. We found out that there is no logs of “the finality-provider received a new block, start processing” (in finalitySigSubmissionLoop() ). So it seems there was sth wrong with the chain poller.
Then we noticed that fp.poller.SkipToHeight() was called after fp.logger.Info("fast sync is finished",…)
But it’s weird that we don’t see any logs of “the poller has skipped height(s)“.
Root cause analysis
we found out that the code is stuck here
this is what have happened:
finalitySigSubmissionLoop()
entered fast sync mode to catch up to the latest blockpollChain()
runs in a separate process to poll new blocks and put into theblockInfoChan
channelblockInfoChan
got full. sopollChain()
got stuckcase req := <-cp.skipHeightChan:
won't be called.SkipToHeight()
will be calledselect
statement b/crespChan
is never triggeredrespChan
will be used here:Proposed fix
when FP starts, wait until the fast sync got finished to start the poller.