ExocoreNetwork / exocore

5 stars 9 forks source link

Consenus Validator Stop and Start Again Fail to Start Consensus #51

Closed cloud8little closed 1 month ago

cloud8little commented 2 months ago

Summary of Bug

Three validators start consensus, and stop one node, for 2~3 seconds, and start again, it failed to start consensus with the other two nodes.

Version

https://github.com/ExocoreNetwork/exocore/pull/49/commits/1d0ac52afd28e24a10d6973af518af2e3a9f633d from pr https://github.com/ExocoreNetwork/exocore/pull/49

Steps to Reproduce

  1. export testnet genesis file here.
  2. copy the genesis file to three node.
  3. start node1~node3.
  4. stop node1(kill the exocored process)
  5. start node1(use the exoccored start to start the node again).

Screenshots

【NOTE】Node2 and Node3 continue to generate new blocks, but node1 is not able to join the consensus.

Node1 Log:

8:24AM INF commit is for a block we do not know about; set ProposalBlock=nil commit=C4D8207B248E7528537C09A84884E998413C4CB9170B04D4FB7F9D37DA6938B1 commit_round=0 height=3516111 module=consensus proposal= server=node
8:24AM INF received complete proposal block hash=C4D8207B248E7528537C09A84884E998413C4CB9170B04D4FB7F9D37DA6938B1 height=3516111 module=consensus server=node
8:24AM ERR CONSENSUS FAILURE!!! err="+2/3 committed an invalid block: wrong Block.Header.AppHash.  Expected 3E40D4654A1A9443A54C2FAB00DBED6C424D29EE26663C70AFDA11082888112A, got 5D4A868F0E2F1D3F39F8DFA1A90F0DBF8248D5C9DC07E0AEA53BA149CEAD24D4" module=consensus server=node stack="goroutine 139 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:732 +0x46\npanic({0x2d5d580?, 0xc00181aac0?})\n\truntime/panic.go:914 +0x21f\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc001803180, 0x35a6cf)\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:1640 +0xfb3\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc001803180, 0x35a6cf)\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:1609 +0x2f6\ngithub.com/cometbft/cometbft/consensus.(*State).handleCompleteProposal(0xc001803180, 0xc001a46700?)\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:1995 +0x37d\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001803180, {{0x40f01c0, 0xc00281c780}, {0xc00281a8d0, 0x28}})\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:842 +0x1a5\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001803180, 0x0)\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:768 +0x3d1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 121\n\tgithub.com/cometbft/cometbft@v0.37.2/consensus/state.go:379 +0x10c\n"

For Admin Use

MaxMustermann2 commented 2 months ago

Thank you for pointing out this issue. I can reproduce it following your instructions. But it is very difficult to debug. I have mapped out a list of possible strategies and I will try again tomorrow.

MaxMustermann2 commented 1 month ago

In the x/oracle module, the function GetAggregatorContext is called at the end of each block. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/module.go#L156-L160

The function is designed to short circuit if agc != nil. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L27-L30

If a node is restarted, agc becomes nil, and is therefore reinitialized via initAggregatorContext. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L34-L38

Within this function, the validators are added to cache. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L132

This results in cache.validators.update (as well as cache.params.update) becoming true. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L95-L111

Meanwhile, on the other nodes the same boolean is false. It is used to commit the cache (or rather the height at which the cache was modified), to disk by calling CommitCache within the EndBlock function. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L199-L202

This mismatch in the boolean value results in the restarted node saving a different block height for the validator update block than other nodes, which, of course, results in the app hash (state root) mismatch. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L113-L116

cloud8little commented 1 month ago

close as it's been resolved.