babylonchain / finality-provider

A peripheral program run by the finality providers
Other
16 stars 28 forks source link

When e2e failed, background daemons are still running #397

Open bap2pecs opened 4 months ago

bap2pecs commented 4 months ago

this gave us lots of trouble when debugging a failed test

$ make test-e2e-op                                                                             [18:13:35]
cd tools; \
        go install -trimpath github.com/babylonchain/babylon/cmd/babylond
go test -mod=readonly -timeout=25m -v github.com/babylonchain/finality-provider/itest github.com/babylonchain/finality-provider/itest/opstackl2 -count=1 --tags=e2e_op
?       github.com/babylonchain/finality-provider/itest [no test files]
=== RUN   TestSubmitFinalitySignature
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
service injective.evm.v1beta1.Msg does not have cosmos.msg.v1.service proto annotation
    test_manager.go:105: Babylon node is started
2024/06/21 18:17:56 Cannot remove dir 1
2024/06/21 18:17:56 Cannot remove dir 2
    test_manager.go:110: 
                Error Trace:    /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/test_manager.go:110
                                                        /opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/panic.go:770
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/cosmwasmclient/client/keys.go:18
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/e2e_test.go:42
                                                        /Users/<redacted>/Documents/Projects/babylon-finality-provider/itest/opstackl2/e2e_test.go:77
                Error:          Received unexpected error:
                                exit status 1
                Test:           TestSubmitFinalitySignature
--- FAIL: TestSubmitFinalitySignature (1.88s)
FAIL
FAIL    github.com/babylonchain/finality-provider/itest/opstackl2       2.833s
FAIL
make: *** [test-e2e-op] Error 1

then we realized it's b/c there were some process running:

$ ps                                                                                           [18:17:56]
  PID TTY           TIME CMD
 8321 ttys001    2:12.87 babylond start --home=/var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/zBabylonTest2191261572/node0/babyl
 8329 ttys001    2:17.42 wasmd start --home /var/folders/9_/q4wsdnh14_s60_74cd2rbztm0000gp/T/ZWasmdTest3482039778 --rpc.laddr tcp:
92472 ttys001    0:00.29 /bin/zsh -il
99267 ttys033    0:00.21 -zsh

we found out the panic happened inside

func (n *babylonNode) stop() (err error) {
    if n.cmd == nil || n.cmd.Process == nil {
        // return if not properly initialized
        // or error starting the process
        return nil
    }

    defer func() {
        err = n.cmd.Wait()
    }()

    if runtime.GOOS == "windows" {
        return n.cmd.Process.Signal(os.Kill)
    }
    return n.cmd.Process.Signal(os.Interrupt)
}

we should have a better way to deal w it here

bap2pecs commented 4 months ago

today we found it's due to code like this:

func fatal(err error) {
    fmt.Fprintf(os.Stderr, "[fpd] %v\n", err)
    os.Exit(1)
}

so when os.Exit() is called, the process will terminate immediately without running deferred functions. This is because os.Exit does not allow the current function to return, bypassing the defer mechanism.

so code like these are not executed:

func (ctm *OpL2ConsumerTestManager) Stop(t *testing.T) {
    var err error
    err = ctm.FpApp.Stop()
    require.NoError(t, err)
    err = ctm.BabylonHandler.Stop()
    require.NoError(t, err)
    ctm.EOTSServerHandler.Stop()
}

thus leaving some processes dangling

cc @SebastianElvis

SebastianElvis commented 4 months ago

Yeah we are aware of this issue, and great work finding the root cause! Looks like we need a more graceful way to terminate the program compared to os.Exit(1)