babylonchain / finality-provider

A peripheral program run by the finality providers
Other
16 stars 28 forks source link

flaky tests #439

Open bap2pecs opened 4 months ago

bap2pecs commented 4 months ago

example

image

this happens a bit too frequently. just see it again cc @SebastianElvis

SebastianElvis commented 4 months ago

i guess it's simply that we have too many e2e tests so that it sometimes time out? https://app.circleci.com/pipelines/github/babylonchain/finality-provider/1589/workflows/9e581924-d3e1-404a-89a8-83cd29a16658/jobs/3657/parallel-runs/0/steps/0-115?invite=true#step-115-2974_33

If yes, how about we break down the CI pipeline a bit, e.g., 1 workflow for unit tests, 1 workflow for Babylon e2e, 1 workflow for wasm e2e, and 1 workflow for OP e2e?

bap2pecs commented 4 months ago

@SebastianElvis this is before the make test-e2e step. so I don't think the timeout is caused by e2e. if you see this one, you can see normal make test only takes 1 minute but this one takes 10 minutes so it timed out

image

so I think we need to investigate into the unit test to understand why it failed sometimes. maybe it's related to the fuzzing tests but I am not sure

1 workflow for Babylon e2e, 1 workflow for wasm e2e, and 1 workflow for OP e2e

but this is a good idea. we can do it. 18 minutes is indeed too long.

https://circleci.com/docs/parallelism-faster-jobs/ seems sth worth exploring

SebastianElvis commented 4 months ago

Hmm I think you are right.

Digging into the error stack and I see

goroutine 222 [chan receive, 9 minutes]:
testing.(*F).Fuzz.func1({0x3c05ca0?, 0xc0017ec5a0}, {{0x0, 0x0}, {0xc000e0b7d8, 0x6}, {0x0, 0x0, 0x0}, {0xc000a42b10, ...}, ...})
        /usr/local/go/src/testing/fuzz.go:337 +0x5aa
testing.(*F).Fuzz(0xc0017ec5a0, {0x2d51ee0, 0x367be20})
        /usr/local/go/src/testing/fuzz.go:402 +0xa37
github.com/babylonchain/finality-provider/finality-provider/service_test.FuzzChainPoller_SkipHeight(0x4803e0?)
        /home/circleci/project/finality-provider/service/chain_poller_test.go:73 +0x35
testing.fRunner(0xc0017ec5a0, 0x367bcf8)
        /usr/local/go/src/testing/fuzz.go:722 +0xbe
created by testing.runFuzzTests in goroutine 1
        /usr/local/go/src/testing/fuzz.go:527 +0x84b

which is the test case FuzzChainPoller_SkipHeight. Does that mean sometimes this test case runs indefinitely?

bap2pecs commented 4 months ago

Does that mean sometimes this test case runs indefinitely

maybe yes. i remember seeing this "SkipHeight" method failing sometimes

bap2pecs commented 4 months ago

@SebastianElvis Another way to do parallelism is to set up sth similar to OP's e2e tests:

https://github.com/babylonchain/optimism/blob/develop/op-e2e/system_test.go#L96 https://github.com/babylonchain/optimism/blob/develop/op-e2e/helper.go#L14-L23

bap2pecs commented 4 months ago

the FuzzChainPoller_SkipHeight failure happened again in #452

Update: another instance on Jul 7: https://app.circleci.com/pipelines/github/babylonchain/finality-provider/1793/workflows/842fe5b0-1e5f-452f-9b89-ba3617c151dd/jobs/4701

bap2pecs commented 4 months ago

I found another flaky task https://github.com/babylonchain/finality-provider/pull/451#issuecomment-2200177000

not sure why those flaky tasks become more frequent lately cc @gitferry @maurolacy @gusin13 do you guys also feel this way?

bap2pecs commented 4 months ago
$ make test-e2e-op                                                                                                                          [12:25:19]
No processes to kill
rm -rf ~/.babylond ~/.wasmd ~/.bcd
cd tools; \
    go install -trimpath github.com/babylonchain/babylon/cmd/babylond
go test -mod=readonly -timeout=25m -v github.com/babylonchain/finality-provider/itest/opstackl2 -count=1 --tags=e2e_op --run ^TestOpMultipleFinalityProviders
=== RUN   TestOpMultipleFinalityProviders
    babylon_node_handler.go:148: babylon log file: /var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/zBabylonTest518499858/babylon.log
Starting babylond with command:  /Users/zidong/go/bin/babylond start --home=/var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/zBabylonTest518499858/node0/babylond --log_level=trace --trace
    op_test_manager.go:86: Fp home dir: /var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/fpe2etest1914071625/fp-home
    setup.go:490: Generating L2 genesis l2_allocs_mode delta
    op_test_manager.go:151: Register consumer op-stack-l2-901 to Babylon
    op_test_manager.go:171: 
            Error Trace:    /Users/zidong/Documents/Projects/babylon-finality-provider/itest/opstackl2/op_test_manager.go:171
                                        /Users/zidong/Documents/Projects/babylon-finality-provider/itest/opstackl2/op_e2e_test.go:119
            Error:          Received unexpected error:
                            the EOTS manager server is not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:12582: connect: connection refused"
            Test:           TestOpMultipleFinalityProviders
--- FAIL: TestOpMultipleFinalityProviders (21.89s)
FAIL
FAIL    github.com/babylonchain/finality-provider/itest/opstackl2   23.492s
FAIL
make: *** [test-e2e-op] Error 1

saw this again. probably due to EOTS was not ready yet?

bap2pecs commented 4 months ago

another flaky test in test_e2e_babylon job

https://github.com/babylonchain/finality-provider/pull/466#issuecomment-2207362361

=== RUN   TestMultipleFinalityProviders
    babylon_node_handler.go:148: babylon log file: /tmp/zBabylonTest1947099691/babylon.log
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
    babylon_test_manager.go:113: Babylon node is started
    babylon_test_manager.go:169: The test manager is running with 3 finality-provider(s)
fatal error: concurrent map writes
fatal error: concurrent map writes
bap2pecs commented 4 months ago

another flaky test when we enable fast sync in our e2e test TestOpSubmitFinalitySignature

2024-07-03T18:13:42.481-0400    INFO    service/fp_instance.go:337  successfully committed public randomness to the consumer chain  {"consumer_id": "op-stack-l2-901", "pk": "7f740f3eaeba7220cf0536ad711ab6949f9d6d75c6242789824d47c2ae1ae118", "tx_hash": "9E73C2029927120D285DBBFE6327E65A78FB5FF1FA2DFB2CE55274718E783FFD"}
2024-07-03T18:13:42.515-0400    INFO    cosmos/log.go:131   Successful transaction  {"chain_id": "chain-test", "gas_used": 1240653, "fees": "3665ubbn", "fee_payer": "l\ufffd\ufffd}\u0019\ufffd\u000bR\ufffd\ufffd+k{\ufffd\ufffd?\ufffd,f\ufffd", "height": 8, "msg_types": ["/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract", "/cosmwasm.wasm.v1.MsgExecuteContract"], "tx_hash": "672C886AD9ECEB62C1496024F776ADC46B42767B84F0C0936ADEE097BA5787F4"}
2024-07-03T18:13:42.535-0400    DEBUG   service/fastsync.go:91  the finality-provider is catching up by sending finality signatures in a batch  {"pk": "7f740f3eaeba7220cf0536ad711ab6949f9d6d75c6242789824d47c2ae1ae118", "start_height": 2, "synced_height": 11}
unexpected fault address 0x152b50b1e
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0x152b50b1e pc=0x100efc2b4]

goroutine 619708 gp=0x140031d7dc0 m=4 mp=0x14000100008 [running]:
runtime.throw({0x103de88a5?, 0xa0?})
    /opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/panic.go:1023 +0x40 fp=0x14008879360 sp=0x14008879330 pc=0x10098cc10
runtime.sigpanic()
    /opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/signal_unix.go:895 +0x22c fp=0x140088793c0 sp=0x14008879360 pc=0x1009a7f5c
github.com/cometbft/cometbft/proto/tendermint/crypto.(*Proof).Unmarshal(0x14008879608, {0x152b50b1e, 0xf2, 0xf2})
    /Users/zidong/go/pkg/mod/github.com/cometbft/cometbft@v0.38.6/proto/tendermint/crypto/proof.pb.go:698 +0x184 fp=0x140088794a0 sp=0x140088793d0 pc=0x100efc2b4
github.com/babylonchain/finality-provider/clientcontroller/opstackl2.(*OPStackL2ConsumerController).SubmitBatchFinalitySigs(0x140073e0ea0, 0x140077f3f40, {0x140067cb590, 0xa, 0x14003c54470?}, {0x14004e355e0, 0xa, 0x103248870?}, {0x140052c4d80, 0xa, ...}, ...)
    /Users/zidong/Documents/Projects/babylon-finality-provider/clientcontroller/opstackl2/consumer.go:194 +0x188 fp=0x140088796f0 sp=0x140088794a0 pc=0x1032390c8
github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).SubmitBatchFinalitySignatures(0x14007093d40, {0x140067cb590, 0xa, 0xa})
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fp_instance.go:856 +0x2c8 fp=0x14008879800 sp=0x140088796f0 pc=0x10324e598
github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).FastSync(0x14007093d40, 0x103eb3d0d?, 0x1a)
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fastsync.go:83 +0x260 fp=0x140088799f0 sp=0x14008879800 pc=0x1032484c0
github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).tryFastSync(0x14007093d40, 0x1a)
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fp_instance.go:429 +0x11c fp=0x14008879b00 sp=0x140088799f0 pc=0x10324b7bc
github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).finalitySigSubmissionLoop(0x14007093d40)
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fp_instance.go:246 +0xe0 fp=0x14008879fb0 sp=0x14008879b00 pc=0x103249ae0
github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).Start.gowrap1()
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fp_instance.go:133 +0x28 fp=0x14008879fd0 sp=0x14008879fb0 pc=0x103249618
runtime.goexit({})
    /opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/asm_arm64.s:1222 +0x4 fp=0x14008879fd0 sp=0x14008879fd0 pc=0x1009cb474
created by github.com/babylonchain/finality-provider/finality-provider/service.(*FinalityProviderInstance).Start in goroutine 4246
    /Users/zidong/Documents/Projects/babylon-finality-provider/finality-provider/service/fp_instance.go:133 +0x580

added some debug logs: https://gist.github.com/bap2pecs/f75f3134ad7d71865aa011d548c46056

bap2pecs commented 4 months ago

EOTS manager related flaky failure. not sure if there is some race condition

    setup.go:490: Generating L2 genesis l2_allocs_mode delta
2024-07-04T20:43:51.039-0400    INFO    cosmos/log.go:131   Successful transaction  {"chain_id": "chain-test", "gas_used": 71372, "fees": "2086ubbn", "fee_payer": "Iۙ\u0001ڌYP\"\ufffd\u001e\ufffd\ufffdFC\ufffd\ufffdS<-", "height": 4, "msg_types": ["/babylon.btcstkconsumer.v1.MsgRegisterConsumer"], "tx_hash": "A1ADFA98B2566478825BB6E2C29D8B2BD315BB2328A9A000A21431018C561B15"}
    op_test_manager.go:150: Register consumer op-stack-l2-901 to Babylon
    op_test_manager.go:170: 
            Error Trace:    /Users/zidong/Documents/Projects/babylon-finality-provider/itest/opstackl2/op_test_manager.go:170
                                        /Users/zidong/Documents/Projects/babylon-finality-provider/itest/opstackl2/op_e2e_test.go:119
            Error:          Received unexpected error:
                            the EOTS manager server is not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:12582: connect: connection refused"
            Test:           TestOpMultipleFinalityProviders
--- FAIL: TestOpMultipleFinalityProviders (21.87s)

this throws at L170 in tm:

    eotsHomeDir := filepath.Join(testDir, "eots-home")
    eotsCfg := eotsconfig.DefaultConfigWithHomePath(eotsHomeDir)
    eh := e2eutils.NewEOTSServerHandler(t, eotsCfg, eotsHomeDir)
    eh.Start()
    eotsCli, err := client.NewEOTSManagerGRpcClient(cfg.EOTSManagerAddress)
    require.NoError(t, err) // Line 170

which throws at eotsmanager/client/rpcclient.go

    if err := gClient.Ping(); err != nil {
        return nil, fmt.Errorf("the EOTS manager server is not responding: %w", err)
    }
lesterli commented 4 months ago

After changing the value of defaultPollingInterval to 2s, the e2e test case TestDoubleSigning was failed.

babylon_test_manager.go:254: the block is finalized at 8
    babylon_e2e_test.go:99: the equivocation attack is successful
    babylon_test_manager.go:260: 
                Error Trace:    /Users/lyy/babylon/finality-provider/itest/babylon/babylon_test_manager.go:260
                                                        /Users/lyy/babylon/finality-provider/itest/babylon/babylon_e2e_test.go:101
                Error:          Condition never satisfied
                Test:           TestDoubleSigning
FAIL    github.com/babylonchain/finality-provider/itest/babylon 597.114s
bap2pecs commented 3 months ago

docker image build sometimes can also fail:

image