babylonlabs-io / finality-provider

Other
6 stars 10 forks source link

Start poller after fast sync is completed #54

Open bap2pecs opened 1 month ago

bap2pecs commented 1 month ago

Summary

this is to fix https://github.com/babylonlabs-io/finality-provider/issues/53

it also makes sense b/c fast sync will fetch the blocks anyway. so it's a waste of resources as well if there is another poller routine doing the same thing

Test Plan

make test-e2e-op
make test
make test-e2e-babylon --run ^TestFastSync$

tested this on GCP, waiting for the FP to run out of BBN. Then, after restarting FP, the sigs submission will work fine. https://gist.github.com/lesterli/830f6ec7cac7c8a9d43b251039d20c8b

bap2pecs commented 1 month ago

not sure why CI failed even on lint. make lint runs fine locally

image
bap2pecs commented 1 month ago

also seems babylon-private access in CI is still broken

image
SebastianElvis commented 1 month ago

Thanks for the proposed fix! @gitferry what do you think about starting poller only after fast sync? actually I remember we had a discussion on it but somehow did not push it further

Also, if this idea is approved, how about merging it to main rather than the base branch?

bap2pecs commented 1 month ago

In the previous implementation, the poller is started after bootstrap is finished which involves fast sync, so I don't understand what specific case does this PR want to address. Did I miss anything?

this is to fix an edge case where FP runs out of BBN token to vote and couldn't recover. see details in https://github.com/babylonlabs-io/finality-provider/issues/53

bap2pecs commented 1 month ago

@gitferry. ok we just took another look at the logs. so what happened was:

after topping up BBN and restarting the FP, bootstrap will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:

func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) {
    lastCommittedHeight, err := fp.GetLastCommittedHeight()
    if err != nil {
        return false, err
    }
    if b.Height > lastCommittedHeight {
        fp.logger.Debug(
            "the finality provider has not committed public randomness for the height",
            zap.String("pk", fp.GetBtcPkHex()),
            zap.Uint64("block_height", b.Height),
            zap.Uint64("last_committed_height", lastCommittedHeight),
        )
        return false, nil
    }

    return true, nil
}

then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"

see relevant logs

2024-09-08T13:06:49.101779Z info    Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-08T13:06:49.104649Z debug   the finality-provider is entering fast sync {"start_height": 25444, "target_block_height": 46340}
2024-09-08T13:07:53.076012Z debug   the finality provider has not committed public randomness for the height    {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500}
2024-09-08T13:07:53.090461Z info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}

cc @SebastianElvis

gitferry commented 1 month ago

@gitferry. ok we just took another look at the logs. so what happened was:

after topping up BBN and restarting the FP, bootstrap will only proceed to the block that has pub rand. when it detects a block that doesn't have PR, it will return without error:

func (fp *FinalityProviderInstance) hasRandomness(b *types.BlockInfo) (bool, error) {
  lastCommittedHeight, err := fp.GetLastCommittedHeight()
  if err != nil {
      return false, err
  }
  if b.Height > lastCommittedHeight {
      fp.logger.Debug(
          "the finality provider has not committed public randomness for the height",
          zap.String("pk", fp.GetBtcPkHex()),
          zap.Uint64("block_height", b.Height),
          zap.Uint64("last_committed_height", lastCommittedHeight),
      )
      return false, nil
  }

  return true, nil
}

then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped"

see relevant logs

2024-09-08T13:06:49.101779Z   info    Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-08T13:06:49.104649Z   debug   the finality-provider is entering fast sync {"start_height": 25444, "target_block_height": 46340}
2024-09-08T13:07:53.076012Z   debug   the finality provider has not committed public randomness for the height    {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 46334, "last_committed_height": 25500}
2024-09-08T13:07:53.090461Z   info the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 25501}

cc @SebastianElvis

It seems that the issue is that fast sync reaches the latest height that pub rand is committed but still far from the tip height, so the fast sync got triggered frequently causing the poller stuck. If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

bap2pecs commented 1 month ago

If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

@lesterli could you help take a look later today? i will be mostly in events today

lesterli commented 1 month ago

If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously.

@lesterli could you help take a look later today? i will be mostly in events today

after checking the FP codes, the defaultRandomInterval is 30s and defaultFastSyncInterval is 10s, when fast sync was running and there were many lagging blocks(25444-46340), it is possible that some blocks' pub rand had not been submitted.

There is a similar case when fp was just started:

2024-09-06T14:43:05.123360Z info    Starting finality-provider instance {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e"}
2024-09-06T14:43:05.128529Z debug   the finality-provider is entering fast sync {"start_height": 1, "target_block_height": 27}
2024-09-06T14:43:05.132508Z debug   Successfully batch query blocks {"start_height": 1, "end_height": 27, "limit": 10, "last_block_hash": "20859a1e111efde38dcdc70037c60d8eba302ab635753d7dd031c8dd180a0064"}
2024-09-06T14:43:05.139480Z debug   the finality provider has not committed public randomness for the height    {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "block_height": 1, "last_committed_height": 0}
2024-09-06T14:43:05.169614Z info    the finality-provider has been bootstrapped {"pk": "e1241d1fa4087b853bf8f92734241e52afddd629fdb5620e91c393a907425e5e", "height": 1}
2024-09-06T14:43:05.169695Z info    starting the chain poller
2024-09-06T14:43:05.173325Z info    the chain poller is successfully started
lesterli commented 1 month ago

bootstrap is finished which involves fast sync

@gitferry in the bootstrap(), should we first check FastSyncInterval? If the fast sync is disabled, we should not trigger fast sync, right?

bap2pecs commented 1 month ago

@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.

so it seems we will still need this change

gitferry commented 1 month ago

@gitferry in the bootstrap(), should we first check FastSyncInterval? If the fast sync is disabled, we should not trigger fast sync, right?

Yes, we have the check at the beginning of checkLaggingLoop

gitferry commented 1 month ago

@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful.

so it seems we will still need this change

we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height

bap2pecs commented 1 month ago

we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height

right but the problem is bootstrap () will not wait for it. thus the edge case i.e. poller channel gets full still exists. and this PR tries to fix that

bap2pecs commented 1 month ago

ok we just realized that this PR still has some problems. i.e. pubrand is lagging a lot so fast sync exits early

https://github.com/babylonlabs-io/finality-provider/compare/base/consumer-chain-support...fix/tmp-fix-poller?expand=1 this might be an easier fix to avoid all the headaches but still resolves #53

basically there are 3 ways I can think of:

  1. sth similar to this PR
  2. increase buffer size (feels this is the easiest path)
  3. block bootstrap() until fast sync of both pub rand and sig submission completed

cc @gitferry @SebastianElvis

bap2pecs commented 1 month ago

another discussion topic is whether we should fast sync pub rand

bap2pecs commented 1 month ago

discussed w @gitferry.

lesterli commented 1 month ago

created https://github.com/babylonlabs-io/finality-provider/pull/77