Closed cloud8little closed 1 month ago
Thank you for pointing out this issue. I can reproduce it following your instructions. But it is very difficult to debug. I have mapped out a list of possible strategies and I will try again tomorrow.
In the x/oracle
module, the function GetAggregatorContext
is called at the end of each block.
https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/module.go#L156-L160
The function is designed to short circuit if agc != nil
.
https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L27-L30
If a node is restarted, agc
becomes nil
, and is therefore reinitialized via initAggregatorContext
.
https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L34-L38
Within this function, the validators are added to cache. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/single.go#L132
This results in cache.validators.update
(as well as cache.params.update
) becoming true
.
https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L95-L111
Meanwhile, on the other nodes the same boolean is false
. It is used to commit the cache (or rather the height at which the cache was modified), to disk by calling CommitCache
within the EndBlock
function.
https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L199-L202
This mismatch in the boolean value results in the restarted node saving a different block height for the validator update block than other nodes, which, of course, results in the app hash (state root) mismatch. https://github.com/ExocoreNetwork/exocore/blob/9f1b6a8d97d37596c8dda99f362d69c5799547c2/x/oracle/keeper/cache/caches.go#L113-L116
close as it's been resolved.
Summary of Bug
Three validators start consensus, and stop one node, for 2~3 seconds, and start again, it failed to start consensus with the other two nodes.
Version
https://github.com/ExocoreNetwork/exocore/pull/49/commits/1d0ac52afd28e24a10d6973af518af2e3a9f633d from pr https://github.com/ExocoreNetwork/exocore/pull/49
Steps to Reproduce
Screenshots
【NOTE】Node2 and Node3 continue to generate new blocks, but node1 is not able to join the consensus.
Node1 Log:
For Admin Use