cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.08k stars 3.5k forks source link

[Question]: Cosmos added module upgrade encountered consensus issues #20746

Open lmkdbd opened 2 weeks ago

lmkdbd commented 2 weeks ago

Is there an existing issue for this?

What happened?

We encountered a strange issue when upgrading with cosmovisor. When attempting to add a module on multiple validator nodes, everything runs fine if all validators remain running post-upgrade. However, manually stopping and restarting a validator node results in a consensus issue. The specific error is:

ERR prevote step: consensus deems this block invalid; prevoting nil err="wrong Block.Header.AppHash.  Expected ..., got ..."

We spent several weeks troubleshooting this, including:

  1. Reviewing code for random values, time, maps, etc.
  2. Using iaviewer to inspect data across nodes and comparing exported leveldb data
  3. Incrementally reducing code complexity and repeatedly upgrading to pinpoint the issue
  4. Migrating the module to simapp to check if the issue could be reproduced

We ultimately found that having multiple structs in the keeper and modifying one of the struct's values in EndBlocker causes the consensus failure. We created a minimal module based on v0.50.6 to test the upgrade, adding an almost empty testmodule with only the UpdateParams method and the data structure to be tested.

The changes based on v0.50.6 are here: https://github.com/cosmos/cosmos-sdk/compare/v0.50.6...lmkdbd:cosmos-sdk:test

In the following cases, the upgrade version is fully compiled from v0.50.6 code:

  1. When the keeper only declares the Params struct and attempts to modify Params in EndBlocker, stopping and restarting a validator node post-upgrade works fine. Successful upgrade commit: https://github.com/cosmos/cosmos-sdk/commit/3ab5d101c50afbe10406456b990c22d9bbc8bfe7
  2. When the keeper declares both Params and TestStruct, and attempts to modify TestStruct in EndBlocker, stopping and restarting a validator node post-upgrade results in consensus failure. The difference from the successful upgrade commit is here: https://github.com/cosmos/cosmos-sdk/commit/d97a912f536279a0b021bfadfb70f75850f44040

We hope someone can help identify the issue.

Our software versions are:

cosmovisor: v0.50.0 cosmos-sdk: v0.50.6

Cosmos SDK Version

0.50.6

How to reproduce?

  1. Compile simapp based on v0.50.6

  2. Start two validator nodes using cosmovisor

  3. Compile the upgraded simapp version using the [test](https://github.com/lmkdbd/cosmos-sdk/tree/test) branch

  4. Place the upgraded simapp version in cosmovisor under the directory named test-module

  5. Submit an upgrade proposal named test-module

  6. After the upgrade completes, let the nodes produce a few more blocks

  7. Stop one of the validator nodes and restart

beer-1 commented 1 week ago

Maybe related with https://github.com/cosmos/iavl/pull/880

iavl had problem for adding new store(module) and we patched it but release does not contain the patch yet. So we are also waiting the release which includes the patch.

lmkdbd commented 1 week ago

Maybe related with cosmos/iavl#880

iavl had problem for adding new store(module) and we patched it but release does not contain the patch yet. So we are also waiting the release which includes the patch.

Thank you very much! We haven't looked into the deeper code logic and could only observe the behavior at the Cosmos-SDK level. Based on the issue you provided, we finally understand the possible cause of the problem. Once the code is merged into Cosmos-SDK, we will also try to see if it resolves the issue.

zhi-lu commented 1 week ago

@beer-1 Good brother, thank you very much for solving our big problem.

tac0turtle commented 1 week ago

what version of iavl are you using?

also are you writing the structs to storage or keeping them in memory?

looking at the linked code im not sure what the issue is?

the issue linked from iavl is very old. No one else is having this issue i dont think its related to what beer-1 is saying

lmkdbd commented 1 week ago

what version of iavl are you using?

also are you writing the structs to storage or keeping them in memory?

looking at the linked code im not sure what the issue is?

the issue linked from iavl is very old. No one else is having this issue i dont think its related to what beer-1 is saying

Currently, we are testing by directly writing the code mentioned in the issue https://github.com/cosmos/iavl/pull/880 into the iavl package.

I checked the source code of iavl, and although the code was merged, it was removed in this commit: https://github.com/cosmos/iavl/commit/2894221156b29321c8bea3ffab8c48e4748cd741. You can search for the following code in this commit to confirm the deletion:

tree.version = int64(tree.ndb.opts.InitialVersion)

I also checked versions v1.1.1 and v1.1.2 of iavl, which are currently used by cosmos-sdk up to v0.50.7. In the v1.1.1 version, found here: https://github.com/cosmos/iavl/blob/v1.1.1/mutable_tree.go, the mentioned code is not present.

However, in v1.1.3, it seems that a consensus issue was fixed, which might be another approach to solving our problem. The commit is here: https://github.com/cosmos/iavl/commit/86c273df26984df19bc1b548a95d98d768b50918. I'm not sure if this fix will resolve our issue.

But what I can confirm is that neither v1.1.1 nor v1.1.2 of iavl contains either of these commits.

tac0turtle commented 1 week ago

please use the latest version of iavl 1.1.x as 1.1.1 and 1.1.2 had issues. We retracted those versions.

lmkdbd commented 1 week ago

please use the latest version of iavl 1.1.x as 1.1.1 and 1.1.2 had issues. We retracted those versions.

We noticed that the new version of iavl has fixed a consensus error. The main issue now is that the latest release version of cosmos-sdk is v0.50.7, which depends on iavl version v1.1.2. We need to confirm whether future releases of cosmos-sdk will update the iavl dependency to v1.1.x so that we can update our modules by referencing cosmos-sdk v0.50.x. Currently, the cosmos-sdk v0.50.x branch still references iavl version v1.1.2.

scirner22 commented 1 day ago

https://github.com/cosmos/iavl/blob/master/CHANGELOG.md

@tac0turtle Is https://github.com/cosmos/iavl/pull/943 (1.1.4) the fix for this issue?

tac0turtle commented 8 hours ago

yes this should fix the issue