cosmos / gaia

Cosmos Hub
https://hub.cosmos.network
Apache License 2.0
475 stars 692 forks source link

[Bug]: Node AppHash'ing after v19.x.x upgrade #3344

Open a26nine opened 1 month ago

a26nine commented 1 month ago

Is there an existing issue for this?

What happened?

Our cosmoshub-4 archive nodes stopped progressing after the v19 upgrade. So, we downloaded the archive snapshot from QuickSync. The nodes progressed smoothly for a while, but then it AppHash'd. We waited for a few days and downloaded another snapshot from the same source, and the results were same again; the node AppHash'd after some time. Once more, we waited for a few days for a new snapshot, got it, and got AppHash'd again.

The most recent AppHash happened on v19.2.0:

Sep 18 08:42:30 cosmovisor[7778]: 8:42AM ERR Error in validation err="wrong Block.Header.AppHash.  Expected FD8867280ABAD4DF361A7285B14822FC3ABE20CB4BDF8B87DDDF152C9A772450, got A4E3A67B70A3FF6C5C6EA6C9B2BDA694E886C6D2B1710F51182843C5AA5A887B" module=blocksync

I am not sure who/what is the culprit here—the snapshot, the binary, or something else?

We rolled back a few times and cleared the wasm directory before starting gaiad. We also tried running with the pre-built binaries supplied in the Releases section. But, none of it helped.

Our build process:

Long Version:

./gaiad version --long
build_deps:
- cloud.google.com/go@v0.114.0
- cloud.google.com/go/auth@v0.5.1
- cloud.google.com/go/auth/oauth2adapt@v0.2.2
- cloud.google.com/go/compute/metadata@v0.3.0
- cloud.google.com/go/iam@v1.1.8
- cloud.google.com/go/storage@v1.41.0
- cosmossdk.io/api@v0.7.5 => github.com/informalsystems/cosmos-sdk/api@v0.7.5-lsm
- cosmossdk.io/client/v2@v2.0.0-beta.3
- cosmossdk.io/collections@v0.4.0
- cosmossdk.io/core@v0.11.1
- cosmossdk.io/depinject@v1.0.0
- cosmossdk.io/errors@v1.0.1
- cosmossdk.io/log@v1.3.1
- cosmossdk.io/math@v1.3.0
- cosmossdk.io/simapp@v0.0.0-20240118210941-3897926e722e
- cosmossdk.io/store@v1.1.0
- cosmossdk.io/tools/confix@v0.1.1
- cosmossdk.io/tools/rosetta@v0.2.1-0.20230613133644-0a778132a60f
- cosmossdk.io/x/circuit@v0.1.1
- cosmossdk.io/x/evidence@v0.1.1
- cosmossdk.io/x/feegrant@v0.1.1
- cosmossdk.io/x/nft@v0.1.1
- cosmossdk.io/x/tx@v0.13.4
- cosmossdk.io/x/upgrade@v0.1.4
- filippo.io/edwards25519@v1.1.0
- github.com/99designs/keyring@v1.2.2 => github.com/cosmos/keyring@v1.2.0
- github.com/CosmWasm/wasmd@v0.51.0
- github.com/CosmWasm/wasmvm/v2@v2.0.0
- github.com/DataDog/datadog-go@v3.2.0+incompatible
- github.com/aws/aws-sdk-go@v1.44.224
- github.com/beorn7/perks@v1.0.1
- github.com/bgentry/go-netrc@v0.0.0-20140422174119-9fd32a8b3d3d
- github.com/bgentry/speakeasy@v0.1.1-0.20220910012023-760eaf8b6816
- github.com/bits-and-blooms/bitset@v1.8.0
- github.com/btcsuite/btcd/btcec/v2@v2.3.2
- github.com/cenkalti/backoff/v4@v4.1.3
- github.com/cespare/xxhash/v2@v2.3.0
- github.com/chzyer/readline@v1.5.1
- github.com/cockroachdb/apd/v2@v2.0.2
- github.com/cockroachdb/errors@v1.11.1
- github.com/cockroachdb/logtags@v0.0.0-20230118201751-21c54148d20b
- github.com/cockroachdb/redact@v1.1.5
- github.com/coinbase/rosetta-sdk-go@v0.7.9
- github.com/cometbft/cometbft@v0.38.11
- github.com/cometbft/cometbft-db@v0.12.0
- github.com/cosmos/btcutil@v1.0.5
- github.com/cosmos/cosmos-db@v1.0.2
- github.com/cosmos/cosmos-proto@v1.0.0-beta.5
- github.com/cosmos/cosmos-sdk@v0.50.9 => github.com/cosmos/cosmos-sdk@v0.50.9-lsm
- github.com/cosmos/go-bip39@v1.0.0
- github.com/cosmos/gogogateway@v1.2.0
- github.com/cosmos/gogoproto@v1.6.0
- github.com/cosmos/iavl@v1.1.2
- github.com/cosmos/ibc-apps/middleware/packet-forward-middleware/v8@v8.0.2
- github.com/cosmos/ibc-apps/modules/rate-limiting/v8@v8.0.0
- github.com/cosmos/ibc-go/modules/capability@v1.0.0
- github.com/cosmos/ibc-go/v8@v8.4.0
- github.com/cosmos/ics23/go@v0.10.0
- github.com/cosmos/interchain-security/v5@v5.2.0
- github.com/cosmos/rosetta-sdk-go@v0.10.0
- github.com/creachadair/atomicfile@v0.3.3
- github.com/creachadair/tomledit@v0.0.26
- github.com/davecgh/go-spew@v1.1.2-0.20180830191138-d8f796af33cc
- github.com/decred/dcrd/dcrec/secp256k1/v4@v4.2.0
- github.com/desertbit/timer@v0.0.0-20180107155436-c41aec40b27f
- github.com/distribution/reference@v0.5.0
- github.com/dvsekhvalnov/jose2go@v1.7.0
- github.com/emicklei/dot@v1.6.2
- github.com/fatih/color@v1.17.0
- github.com/felixge/httpsnoop@v1.0.4
- github.com/fsnotify/fsnotify@v1.7.0
- github.com/getsentry/sentry-go@v0.27.0
- github.com/go-kit/kit@v0.12.0
- github.com/go-kit/log@v0.2.1
- github.com/go-logfmt/logfmt@v0.6.0
- github.com/go-logr/logr@v1.4.2
- github.com/go-logr/stdr@v1.2.2
- github.com/godbus/dbus@v0.0.0-20190726142602-4481cbc300e2
- github.com/gogo/googleapis@v1.4.1
- github.com/gogo/protobuf@v1.3.2
- github.com/golang/groupcache@v0.0.0-20210331224755-41bb18bfe9da
- github.com/golang/mock@v1.6.0
- github.com/golang/protobuf@v1.5.4
- github.com/golang/snappy@v0.0.5-0.20220116011046-fa5810519dcb
- github.com/google/btree@v1.1.2
- github.com/google/go-cmp@v0.6.0
- github.com/google/gofuzz@v1.2.0
- github.com/google/orderedcode@v0.0.1
- github.com/google/s2a-go@v0.1.7
- github.com/google/uuid@v1.6.0
- github.com/googleapis/enterprise-certificate-proxy@v0.3.2
- github.com/googleapis/gax-go/v2@v2.12.4
- github.com/gorilla/handlers@v1.5.2
- github.com/gorilla/mux@v1.8.1
- github.com/gorilla/websocket@v1.5.0
- github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0
- github.com/grpc-ecosystem/grpc-gateway@v1.16.0
- github.com/gsterjov/go-libsecret@v0.0.0-20161001094733-a6f4afe4910c
- github.com/hashicorp/go-cleanhttp@v0.5.2
- github.com/hashicorp/go-getter@v1.7.5
- github.com/hashicorp/go-hclog@v1.5.0
- github.com/hashicorp/go-immutable-radix@v1.3.1
- github.com/hashicorp/go-metrics@v0.5.3
- github.com/hashicorp/go-plugin@v1.5.2
- github.com/hashicorp/go-safetemp@v1.0.0
- github.com/hashicorp/go-version@v1.7.0
- github.com/hashicorp/golang-lru@v1.0.2
- github.com/hashicorp/golang-lru/v2@v2.0.7
- github.com/hashicorp/hcl@v1.0.0
- github.com/hashicorp/yamux@v0.1.1
- github.com/hdevalence/ed25519consensus@v0.1.0
- github.com/huandu/skiplist@v1.2.0
- github.com/iancoleman/orderedmap@v0.3.0
- github.com/iancoleman/strcase@v0.3.0
- github.com/improbable-eng/grpc-web@v0.15.0
- github.com/jmespath/go-jmespath@v0.4.0
- github.com/klauspost/compress@v1.17.7
- github.com/kr/pretty@v0.3.1
- github.com/kr/text@v0.2.0
- github.com/lib/pq@v1.10.9
- github.com/magiconair/properties@v1.8.7
- github.com/manifoldco/promptui@v0.9.0
- github.com/mattn/go-colorable@v0.1.13
- github.com/mattn/go-isatty@v0.0.20
- github.com/minio/highwayhash@v1.0.2
- github.com/mitchellh/go-homedir@v1.1.0
- github.com/mitchellh/go-testing-interface@v1.14.1
- github.com/mitchellh/mapstructure@v1.5.0
- github.com/mtibben/percent@v0.2.1
- github.com/oasisprotocol/curve25519-voi@v0.0.0-20230904125328-1f23a7beb09a
- github.com/oklog/run@v1.1.0
- github.com/opencontainers/go-digest@v1.0.0
- github.com/pelletier/go-toml/v2@v2.2.2
- github.com/pkg/errors@v0.9.1
- github.com/pmezard/go-difflib@v1.0.1-0.20181226105442-5d4384ee4fb2
- github.com/prometheus/client_golang@v1.19.0
- github.com/prometheus/client_model@v0.6.1
- github.com/prometheus/common@v0.52.2
- github.com/prometheus/procfs@v0.13.0
- github.com/rakyll/statik@v0.1.7
- github.com/rcrowley/go-metrics@v0.0.0-20201227073835-cf1acfcdf475
- github.com/rogpeppe/go-internal@v1.12.0
- github.com/rs/cors@v1.8.3
- github.com/rs/zerolog@v1.33.0
- github.com/sagikazarmark/slog-shim@v0.1.0
- github.com/skip-mev/feemarket@v1.1.0
- github.com/spf13/afero@v1.11.0
- github.com/spf13/cast@v1.6.0
- github.com/spf13/cobra@v1.8.1
- github.com/spf13/pflag@v1.0.5
- github.com/spf13/viper@v1.19.0
- github.com/stretchr/testify@v1.9.0
- github.com/subosito/gotenv@v1.6.0
- github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d => github.com/syndtr/goleveldb@v1.0.1-0.20210819022825-2ae1ddf74ef7
- github.com/tendermint/go-amino@v0.16.0
- github.com/tidwall/btree@v1.7.0
- github.com/ulikunitz/xz@v0.5.11
- go.opencensus.io@v0.24.0
- go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.49.0
- go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0
- go.opentelemetry.io/otel@v1.24.0
- go.opentelemetry.io/otel/metric@v1.24.0
- go.opentelemetry.io/otel/trace@v1.24.0
- golang.org/x/crypto@v0.26.0
- golang.org/x/exp@v0.0.0-20240404231335-c0f41cb1a7a0
- golang.org/x/net@v0.28.0
- golang.org/x/oauth2@v0.20.0
- golang.org/x/sync@v0.8.0
- golang.org/x/sys@v0.23.0
- golang.org/x/term@v0.23.0
- golang.org/x/text@v0.17.0
- golang.org/x/time@v0.5.0
- google.golang.org/api@v0.180.0
- google.golang.org/genproto@v0.0.0-20240401170217-c3f982113cda
- google.golang.org/genproto/googleapis/api@v0.0.0-20240610135401-a8a62080eff3
- google.golang.org/genproto/googleapis/rpc@v0.0.0-20240709173604-40e1e62336c5
- google.golang.org/grpc@v1.65.0
- google.golang.org/protobuf@v1.34.2
- gopkg.in/ini.v1@v1.67.0
- gopkg.in/yaml.v2@v2.4.0
- gopkg.in/yaml.v3@v3.0.1
- gotest.tools/v3@v3.5.1
- nhooyr.io/websocket@v1.8.6
- pgregory.net/rapid@v1.1.0
- sigs.k8s.io/yaml@v1.4.0
build_tags: netgo
commit: 1d52b3d434d5d78561c3628ef351b88890ec7f7f
cosmos_sdk_version: v0.50.9-lsm
go: go version go1.22.6 linux/amd64
name: gaia
server_name: gaiad
version: v19.2.0

Gaia Version

v19.2.0

How to reproduce?

The node will AppHash after some time.

MSalopek commented 1 month ago

Thank you for raising this concern, I'm sorry you are facing issues.

Have you tried any other node (default, pruned) from quicksync?

We can try and get in contact with quicksync and help them debug.

It is weird if this is happening frequently but we have had reports of apphashes tied to wasm directories ever since we have introduced cosmwasm.

If possible, it would be beneficial to try and replicate this on a smaller node (to reduce debugging times). If this issue happens with other quicksync snapshots but not on polkachu or nodestake it could point to a slight misconfiguration in quicksync's export procedure that can be mitigated.

a26nine commented 1 month ago

Thank you for raising this concern, I'm sorry you are facing issues.

Have you tried any other node (default, pruned) from quicksync?

We can try and get in contact with quicksync and help them debug.

It is weird if this is happening frequently but we have had reports of apphashes tied to wasm directories ever since we have introduced cosmwasm.

If possible, it would be beneficial to try and replicate this on a smaller node (to reduce debugging times). If this issue happens with other quicksync snapshots but not on polkachu or nodestake it could point to a slight misconfiguration in quicksync's export procedure that can be mitigated.

I forgot to mention, we downloaded Polkachu's pruned snapshot, and it's running fine with the same binary without any issues.

a26nine commented 1 month ago

@MSalopek, did you get a chance to check with the QuickSync team?

mayank-daga commented 1 month ago

@MSalopek even i am getting similar issues

MSalopek commented 1 month ago

ChainLayer has been contacted. Updates will be posted as they reach us.

MSalopek commented 3 weeks ago

The issue seems to be solved on Quicksync's end.

Feel free to resync from the newest snapshot.

@mayank-daga @a26nine

a26nine commented 3 weeks ago

The issue seems to be solved on Quicksync's end.

Feel free to resync from the newest snapshot.

@mayank-daga @a26nine

No, it's not resolved. We are running pruned nodes for now.

jgrebowicz-ledger commented 1 week ago

I can confirm that it's not resloved yet. I've redownloaded archive for 2 of our RPC nodes after message that it got fixed on Quicksync's end, but if keeps on failing. We've run pruned node as a backup, but it tends to fail too after a while...

MSalopek commented 1 week ago

Sorry to hear that this is stil persisting.

We could provide instructions for a stop-gap solution that you could execute. The solution would require syncing an old gaia node instance and performing upgrades at designated block heights.

Unfortunately, we do not have other action we can perform here other than checking in with quicksync to help troubleshoot.

I will keep this issue open and close all other related issues.

jgrebowicz-ledger commented 1 week ago

I'll reach our to you if we would decide to follow stop-gap solution. Unfortunately, we're having problems with apphash no matter which snapshot we use. We still have one node that's running for a long time on default snapshot and it runs fine, but if we spin up new node with the same config and download new snapshot - it fails after a while. (Actually it's the same with archival).

Yesterday once again we downloaded latest archival snapshot, but it failed after a while 12:07AM INF finalized block block_app_hash=24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75 height=22838854 module=state num_txs_res=4 num_val_updates=1 12:07AM INF executed block app_hash=24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75 height=22838854 module=state 12:07AM INF updates to validators module=state updates=5A59DC8746FD727FDDD5CBF5CBB90C6F616CCF9B:3596564 12:07AM INF committed state block_app_hash=0261BEC7EC8EFFF3ABB850402C54B78A81D0B4ABAC9418D2DE3E2D495E09AEA6 height=22838854 module=state 12:07AM ERR Error in validation err="wrong Block.Header.AppHash. Expected 24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75, got 05012E467D0657717BD073AE4A25E3F71B3C85BDF1C1FC8AE35B6AE9391CB372" module=blocksync 12:07AM ERR Stopping peer for error err="reactor validation error: wrong Block.Header.AppHash. Expected 24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75, got 05012E467D0657717BD073AE4A25E3F71B3C85BDF1C1FC8AE35B6AE9391CB372" module=p2p peer="Peer{MConn{141.94.73.39:37656} 2bda8bff758a39916a528c6b70eefad9148d09ce out}" 12:07AM