cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.22k stars 3.59k forks source link

Improve debuggability of chain halts #13404

Open facundomedica opened 2 years ago

facundomedica commented 2 years ago

Summary

When a chain halts we need to get as much information as possible and as easy as possible.

Problem Definition

Currently, the usual way to debug a chain halt is not that easy. We should provide tools and guides on how to debug and pinpoint the root cause of the most common failures.

Proposal

Provide tools and guides for solving:

I think those are the most common errors for chain halts (besides panics).

Some stuff that comes to mind:

alexanderbez commented 2 years ago

Idea I have for debugging LastResultsHash:

Take two data directories and inspect the last block. For both blocks (from both data sets), compare all fields, especially gas used.

For the other two types, it's a bit more difficult. Perhaps we create a tool that encapsulates all of these tools into a single tool.

julienrbrt commented 1 year ago

For the IAVL Viewer, we've already created a tracking issue here: https://github.com/cosmos/iavl/issues/567 as it was as well requested in our DEV UX calls.

yihuang commented 1 year ago

About app hash mismatch, one idea is save the the change stream of last few blocks(based on adr-038), compare with good node, can pinpoint which tx caused the mismatch.

alexanderbez commented 1 year ago

What I was thinking is that we have a tool/binary that is provided two data directories and returns to you debugging output for each of the three types of mismatches.

This is to give the developer or operator a high-level overview for where to start looking. In the end, further in depth analysis will still be required, especially in the case of AppHash mismatch.

eliasnaur commented 1 year ago

@alexanderbez I'm looking into this. Do you have a way to reproduce or fetch two (or more) representative data directories, to demonstrate the usefulness of the tool?

elias-orijtech commented 1 year ago

The initial release of the chdbg tool can now help diagnose ConsensusHash differences:

go run github.com/orijtech/chdbg bns-a.db bns-b.db
chdbg: hash mismatch: 96AAD58DBDF2BA87D90BE1F620E80AC3D1662B5113A7667B51303596163A5969 != 56E581EBD9C0A3D726A91579839F7FF8A9251BEB063FDF0FA0415A0B3429DF6E
chdbg: key _i.bchnft_owner:4C97A7423B1782D7C8CAB362247B848DEC96B1EC: key proofs differ
chdbg: key _i.bchnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.tkrnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.usrnft_chainaddr:1152542575310734325L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:12256717727036376470L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:14285752342776807606L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:177168082075485743L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:2980033962229439650L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:3070406526139113375L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:8565302995323734695L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdgb: ... (additional diffs omitted)
chdbg: database mismatch at version 190258 with 88 differences
exit status 2

I'm looking forward to seeing examples of LastResultsHash or AppHash mismatches to develop the tool further.

tac0turtle commented 1 year ago

@elias-orijtech are you able to open a pr adding the tool to the tools directory.

elias-orijtech commented 1 year ago

I can, but I suggest keeping it out of tools until we're happy with its UX and feature set. Personally, I'd like to see more chain halts that I can analyze and use to refine the tool.

tac0turtle commented 1 year ago

Would love others to chime in, but personally I think adding it to sdk or a tools repo so users can see it and analyse it will get you feedback. Right now it's hard to get feedback when users don't know about it.

The first feedback is this is only for iavl but the issue talks about three issues, lastresulthash, consensus hash and app hash. The tool seems to only work with app hash issues but it's unclear how a user will identify which module the issue comes from. It nots hard to recreate chain halts and test that way too.

elias-orijtech commented 1 year ago

Then perhaps a good time to put this into wider use is when the tool can debug all three types of chain halts? That is, when this issue can be closed in favor of specific issues in the tool.

If you have the time, please sketch a way to achieve realistic chain halts of the 3 interesting variants mentioned.

ValarDragon commented 1 year ago

I don;t think I've ever seen consensus hash issues

tomtau commented 1 year ago

One tool @yihuang worked on and used: https://github.com/crypto-com/python-iavl it's a step up from iavl-viewer

tomtau commented 1 year ago

And a howto tutorial on probing app hash mismatch issues with iavl-viewer, written by @mmsqe and @JayT106 :

Probe app.hash mismatch issue

The current Cosmos SDK stores the app data with the iavl tree structure. Therefore, we need to use the iavl tooling to retrieve the application data.

Pre-requisites

  1. install iaviewer
  2. Check the backend db type: i.e. rocksdb, leveldb, or others.
  3. The default iaviewer is using leveldb as the backend db, to build rocksdb: use customized branch https://github.com/JayT106/iavl/tree/rocksdb-support (TODO: clean up and submit PR to the upstream repo) and run
    CGO_CFLAGS="-I/usr/local/include/rocksdb" CGO_LDFLAGS="-L/usr/local/lib/rocksdb -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd"\
    go build -tags rocksdb ./cmd/iaviewer/

    It will require the system install the rocksdb, download rocksdb and do

    make install

    MacOs

  4. install rocksdb 6.29.3
    curl -O https://raw.githubusercontent.com/Homebrew/homebrew-core/5a6d7658c8686b3326f69c5dd11d08800586ad9c/Formula/rocksdb.rb && brew install rocksdb.rb
  5. build with -tags rocksdb
CGO_CFLAGS="-I/usr/local/Cellar/rocksdb/6.29.3/include" \
CGO_LDFLAGS="-L/usr/local/Cellar/rocksdb/6.29.3/lib -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd -L/usr/local/Cellar/snappy/1.1.9/lib -L/usr/local/Cellar/lz4/1.9.3/lib/ -L /usr/local/Cellar/zstd/1.5.2/lib/"  \
go build -tags rocksdb ./cmd/iaviewer/

Load application.db with iaviewer (with rocksDB embed built)

./iaviewer [data/shape/versions/balance/nonce] [application.db path] [s/k:module name/] <version> <addr>

i.e.
export VER=1600000
export ADDR=57B4B1d6ecC292910840CEdeDE87884b254d4738
./iaviewer balance /chain/.cronosd/data/application.db/ "s/k:bank/" $VER $ADDR

Arguments details:

Modules:

If you are not sure which storekey to use for every module in the project or the cosmos SDK, usually you can find it in x/[module]/key.go in the project or the cosmos sdk Therefore, there are some modules we can check the data status:

"s/k:bank/" "s/k:evm/" "s/k:acc/" "s/k:ibc/"

Compare data set

Usually, you need two data sets (one data set has normal state and another data set suspect has an incorrect state, which causes the apphash mismatch) to compare to know which part might have an issue. Therefore, to probe the data sets and then compare the diff to find which keys are different. You can compare it with the chain explorer with can query the chain by the rpc calls.

./iaviewer data /data2/data/application.db/ $MOD $VER > data_ibc_control ; ./iaviewer data /chain/.cronosd/data/application.db/ $MOD $VER > data_ibc_normal

diff data_ibc_control data_ibc_normal

we might get diff like this, it shows the accounts have different balances at height 1603101, and these are keys list:

02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F
02143B368AF83F84A63E7A1E56715EBAAA9351A6DABD6261736563726F
02145C7F8A570D578ED84E63FDFA7B1EE72DEAE1AE236261736563726F
0214F1829676DB577682E944FC3493D451B67FF3E29F6261736563726F

we ignore the values because it's to represent a hash value of the value, so we are not able to know what's the real value in it. the key 02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F is combined with prefix, address, and denom. You need to check the implementation to know how it be stored. 0214 - prefix, 3299D5EEE1934480072E21D6747DABE7B4D4A73D - account, 6261736563726F - denom (this case is basecro)

if you compare the evm module, you can know some keys are different, i.e.

WBTC
0x062E66477Faf219F25D27dCED647BF57C3107d52
Crona LPs (Crona-LP)
0x285a569EDD6210a0410883d2E29471A6B0c7790d
Wrapped CRO (WCRO)
0x5C7F8A570d578ED84E63fdFA7b1eE72dEae1AE23
Crona LPs (Crona-LP)
0x5cc953f278bf6908B2632c65D6a202D6fd1370f9
Crona LPs (Crona-LP)
0xb4684F52867dC0dDe6F931fBf6eA66Ce94666860
USD Coin (USDC, CronosCRC20)
0xc21223249CA28397B4B6541dfFaEcC539BfF0c59
Wrapped Ether (WETH)
0xe44Fd7fCb2b1581822D0c862B68222998a0c299a

So you can guess which transaction might relate to these accounts.

Query data from the node through node

The current Tendermint (v0.34.x) will do panic when it detects the apphash mismatch state. So it's difficult to start a node service, load the problem data set, and then use rpc call to check the data status, this way you might see more details. For example, call eth_getTransactionReceipt for getting the transaction result of the problem data set. Currently, we hacked the tendermint by passing the handshake step. you can build the application with this

The Tendermint v0.35.x has another way to probe the data details, currently it hasn't been integrated into the Cosmos SDK (v0.46). Ref (TODO: update how to use inspect CLI in the Cosmos SDK)