cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.27k stars 3.63k forks source link

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

Open shupcode opened 4 months ago

shupcode commented 4 months ago

Is there an existing issue for this?

What happened?

@MSalopek - from https://github.com/cosmos/gaia/issues/3138#issuecomment-2180533897

Key observations during the gaia v17 chain halt (https://forum.cosmos.network/t/cosmos-hub-v17-1-chain-halt-post-mortem/13899):

Many container orchestration tools (like Docker Compose / Kubernetes) and monitoring systems are configured to detect non-zero exit codes as failure indicators, triggering alerts or automated recovery processes.

Allowing a node to remain active despite irrecoverable errors introduces numerous complex edge cases that need careful handling.

When a node in error state continues to connect to peers and respond to API calls, it can generate misleading status indicators. For example, it may report catching_up: false inaccurately.

Monitoring latest_block_time can be challenging. Distinguishing between a legitimate chain halt (e.g., during an upgrade) and a fatal error that results in a stale latest_block_time can be problematic.

Maintaining peer connections during a fatal error serves little purpose and may exacerbate issues.

Conclusion: There is a crucial need for improved error identification to differentiate between errors requiring immediate node termination (triggering monitoring system alerts) and those that are recoverable. This distinction is essential for maintaining system integrity and reliability.

Cosmos SDK Version

0.47

How to reproduce?

No response

tac0turtle commented 4 months ago

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

tac0turtle commented 4 months ago

talking with matija, i believe i have a better understanding of the issue. The chain halts but the process does not exit. This should be a simple fix when we find where it needs to go 😄. we will look into it. Thanks for opening the issue

MSalopek commented 4 months ago

Adding some logs for posterity.

The issue occurs at upgrade height, not when a Tx panics. Panics in Tx are handled correctly.

When the upgrade height is reached a panic is shown claiming that there's an upgrade needed (expected). The node continues peering and does not shut down - can lead to issues (unexpected).

Expected behaviour is for the node to fully shut down during a panic in Begin/EndBlocker.

3:39PM INF indexed block events height=17559999 module=txindex
3:39PM INF Timed out dur=420.808051 height=17560000 module=consensus round=0 step=RoundStepNewHeight
3:39PM INF received proposal module=consensus proposal="Proposal{17560000/0 (37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE:1:000170E3AD91, -1) 9461678612CB @ 2024-06-06T15:39:44.733735052Z}" proposer=F243682DFC3A5AF6BE1934A51777DB33E2CB575C
3:39PM INF received complete proposal block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus
3:39PM INF finalizing commit of block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus num_txs=5 root=A4933AC4DAA9C2331A5B557B3094BAF48A69CABBC0ED3BA11A440D1AD7A6C620
3:39PM ERR UPGRADE "v5.0.0" NEEDED at height: 17560000: { "binaries": { "linux/amd64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-amd64.tar.gz", "linux/arm64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-arm64.tar.gz"}} module=x/upgrade
3:39PM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 1069 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:800 +0x46\npanic({0x47077a0?, 0x75ae2c0?})\n\truntime/panic.go:914 +0x21f\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:893 +0x16a\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0xc0012866c0, 0xc1a0cba000)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:925 +0x24c\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock(...)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/server/cmt_abci.go:44\ngithub.com/cometbft/cometbft/abci/client.(*unsyncLocalClient).FinalizeBlock(0xc057a78488?, {0x5627a30?, 0x77393c0?}, 0xc0?)\n\tgithub.com/cometbft/cometbft@v0.38.5/abci/client/unsync_local_client.go:137 +0x26\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0xc00dae1110, {0x5627a30, 0x77393c0}, 0x1?)\n\tgithub.com/cometbft/cometbft@v0.38.5/proxy/app_conn.go:104 +0x170\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc00bb4b90a, 0x6}}, {0xc00bb4b9a0, 0xe}, 0x1, 0x10bf1bf, {{0xc2f987ada0, ...}, ...}, ...}, ...)\n\tgithub.com/cometbft/cometbft@v0.38.5/state/execution.go:226 +0x837\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1833 +0xb7d\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1744 +0x349\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1679 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc001353180, 0x10bf1c0, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1717 +0xcd0\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150, 0x28})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2371 +0x1be5\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150?, 0xc057a79c08?})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2123 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001353180, {{0x55f3740, 0xc1a08187c8}, {0xc002fa8150, 0x28}})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:928 +0x3ce\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001353180, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:835 +0x3d1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 69\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:397 +0x10c\n"

// node is expected to shut down after printing 'CONSENSUS FAILURE'
// instead, the node continues maintaining network connections 

3:39PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=MConn{176.9.29.248:23856} module=p2p msg="Stopping MConnection service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:49746->176.9.29.248:23856: read: connection reset by peer" module=p2p peer="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}"
3:39PM INF service stop impl="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}" module=p2p msg="Stopping Peer service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM INF service stop impl=MConn{85.190.134.37:26656} module=p2p msg="Stopping MConnection service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:41384->85.190.134.37:26656: read: connection reset by peer" module=p2p peer="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}"
3:39PM INF service stop impl="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}" module=p2p msg="Stopping Peer service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM INF service stop impl=MConn{[2a01:4f9:3a:11e0::5]:26656} module=p2p msg="Stopping MConnection service" peer=58ea81df95728fafae81c011462432a72d18ad25@[2a01:4f9:3a:11e0::5]:26656
MSalopek commented 3 months ago

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

Clarifying some of the assumptions from above: The panics are not Tx related, these are FinalizeBlock operations.

Adding more finding from gaia testnets: The same behaviour was observed when using cosmos-sdk v0.50:

julienrbrt commented 1 month ago

Blocked on https://github.com/cometbft/cometbft/issues/4054