Open bap2pecs opened 1 month ago
not sure why CI failed even on lint. make lint
runs fine locally
also seems babylon-private access in CI is still broken
Thanks for the proposed fix! @gitferry what do you think about starting poller only after fast sync? actually I remember we had a discussion on it but somehow did not push it further
Also, if this idea is approved, how about merging it to main
rather than the base branch?
In the previous implementation, the poller is started after
bootstrap
is finished which involvesfast sync
, so I don't understand what specific case does this PR want to address. Did I miss anything?
this is to fix an edge case where FP runs out of BBN token to vote and couldn't recover. see details in https://github.com/babylonlabs-io/finality-provider/issues/53
@gitferry. ok we just took another look at the logs. so what happened was:
after topping up BBN and restarting the FP, bootstrap
will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:
func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) {
lastCommittedHeight, err := fp.GetLastCommittedHeight()
if err != nil {
return false, err
}
if b.Height > lastCommittedHeight {
fp.logger.Debug(
"the finality provider has not committed public randomness for the height",
zap.String("pk", fp.GetBtcPkHex()),
zap.Uint64("block_height", b.Height),
zap.Uint64("last_committed_height", lastCommittedHeight),
)
return false, nil
}
return true, nil
}
then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"
see relevant logs
2024-09-08T13:06:49.101779Z info Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-08T13:06:49.104649Z debug the finality-provider is entering fast sync {"start_height": 25444, "target_block_height": 46340}
2024-09-08T13:07:53.076012Z debug the finality provider has not committed public randomness for the height {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500}
2024-09-08T13:07:53.090461Z info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}
cc @SebastianElvis
@gitferry. ok we just took another look at the logs. so what happened was:
after topping up BBN and restarting the FP,
bootstrap
will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) { lastCommittedHeight, err := fp.GetLastCommittedHeight() if err != nil { return false, err } if b.Height > lastCommittedHeight { fp.logger.Debug( "the finality provider has not committed public randomness for the height", zap.String("pk", fp.GetBtcPkHex()), zap.Uint64("block_height", b.Height), zap.Uint64("last_committed_height", lastCommittedHeight), ) return false, nil } return true, nil }
then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"
see relevant logs
2024-09-08T13:06:49.101779Z info Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"} 2024-09-08T13:06:49.104649Z debug the finality-provider is entering fast sync {"start_height": 25444, "target_block_height": 46340} 2024-09-08T13:07:53.076012Z debug the finality provider has not committed public randomness for the height {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500} 2024-09-08T13:07:53.090461Z info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}
cc @SebastianElvis
It seems that the issue is that fast sync reaches the latest height that pub rand is committed but still far from the tip height, so the fast sync got triggered frequently causing the poller stuck. If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.
If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.
@lesterli could you help take a look later today? i will be mostly in events today
If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.
@lesterli could you help take a look later today? i will be mostly in events today
after checking the FP codes, the defaultRandomInterval
is 30s
and defaultFastSyncInterval
is 10s
, when fast sync was running and there were many lagging blocks(25444-46340
), it is possible that some blocks' pub rand had not been submitted.
There is a similar case when fp was just started:
2024-09-06T14:43:05.123360Z info Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-06T14:43:05.128529Z debug the finality-provider is entering fast sync {"start_height": 1, "target_block_height": 27}
2024-09-06T14:43:05.132508Z debug Successfully batch query blocks {"start_height": 1, "end_height": 27, "limit": 10, "last_block_hash": "20859a1e111efde38dcdc70037c60d8eba302ab635753d7dd031c8dd180a0064"}
2024-09-06T14:43:05.139480Z debug the finality provider has not committed public randomness for the height {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 1, "last_committed_height": 0}
2024-09-06T14:43:05.169614Z info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 1}
2024-09-06T14:43:05.169695Z info starting the chain poller
2024-09-06T14:43:05.173325Z info the chain poller is successfully started
bootstrap
is finished which involvesfast sync
@gitferry in the bootstrap()
, should we first check FastSyncInterval
? If the fast sync is disabled, we should not trigger fast sync, right?
@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.
so it seems we will still need this change
@gitferry in the bootstrap(), should we first check FastSyncInterval? If the fast sync is disabled, we should not trigger fast sync, right?
Yes, we have the check at the beginning of checkLaggingLoop
@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.
so it seems we will still need this change
we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height
we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height
right but the problem is bootstrap () will not wait for it. thus the edge case i.e. poller channel gets full still exists. and this PR tries to fix that
ok we just realized that this PR still has some problems. i.e. pubrand is lagging a lot so fast sync exits early
https://github.com/babylonlabs-io/finality-provider/compare/base/consumer-chain-support...fix/tmp-fix-poller?expand=1 this might be an easier fix to avoid all the headaches but still resolves #53
basically there are 3 ways I can think of:
bootstrap()
until fast sync of both pub rand and sig submission completedcc @gitferry @SebastianElvis
another discussion topic is whether we should fast sync pub rand
discussed w @gitferry.
- @lesterli will cherry-pick feat: batch processing non-finalised blocks #57 to base
created https://github.com/babylonlabs-io/finality-provider/pull/77
Summary
this is to fix https://github.com/babylonlabs-io/finality-provider/issues/53
it also makes sense b/c fast sync will fetch the blocks anyway. so it's a waste of resources as well if there is another poller routine doing the same thing
Test Plan
tested this on GCP, waiting for the FP to run out of BBN. Then, after restarting FP, the sigs submission will work fine. https://gist.github.com/lesterli/830f6ec7cac7c8a9d43b251039d20c8b