Open facundomedica opened 2 years ago
Idea I have for debugging LastResultsHash
:
Take two data directories and inspect the last block. For both blocks (from both data sets), compare all fields, especially gas used.
For the other two types, it's a bit more difficult. Perhaps we create a tool that encapsulates all of these tools into a single tool.
For the IAVL Viewer, we've already created a tracking issue here: https://github.com/cosmos/iavl/issues/567 as it was as well requested in our DEV UX calls.
About app hash mismatch, one idea is save the the change stream of last few blocks(based on adr-038), compare with good node, can pinpoint which tx caused the mismatch.
What I was thinking is that we have a tool/binary that is provided two data directories and returns to you debugging output for each of the three types of mismatches.
LastResultsHash
it inspects the gas used between the two states and reports a diff if anyConsensusHash
it inspects the entire structures between the two states and reports a diff if anyAppHash
it reports the module hashes of each module between the two states and execution order of txs and begin/end blockThis is to give the developer or operator a high-level overview for where to start looking. In the end, further in depth analysis will still be required, especially in the case of AppHash
mismatch.
@alexanderbez I'm looking into this. Do you have a way to reproduce or fetch two (or more) representative data directories, to demonstrate the usefulness of the tool?
The initial release of the chdbg
tool can now help diagnose ConsensusHash
differences:
go run github.com/orijtech/chdbg bns-a.db bns-b.db
chdbg: hash mismatch: 96AAD58DBDF2BA87D90BE1F620E80AC3D1662B5113A7667B51303596163A5969 != 56E581EBD9C0A3D726A91579839F7FF8A9251BEB063FDF0FA0415A0B3429DF6E
chdbg: key _i.bchnft_owner:4C97A7423B1782D7C8CAB362247B848DEC96B1EC: key proofs differ
chdbg: key _i.bchnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.tkrnft_owner:E28AE9A6EB94FC88B73EB7CBD6B87BF93EB9BEF0: key proofs differ
chdbg: key _i.usrnft_chainaddr:1152542575310734325L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:12256717727036376470L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:14285752342776807606L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:177168082075485743L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:2980033962229439650L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:3070406526139113375L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdbg: key _i.usrnft_chainaddr:8565302995323734695L;da3ed6a45429278bac2666961289ca17ad86595d33b31037615d4b8e8f158bba: key proofs differ
chdgb: ... (additional diffs omitted)
chdbg: database mismatch at version 190258 with 88 differences
exit status 2
I'm looking forward to seeing examples of LastResultsHash
or AppHash
mismatches to develop the tool further.
@elias-orijtech are you able to open a pr adding the tool to the tools directory.
I can, but I suggest keeping it out of tools until we're happy with its UX and feature set. Personally, I'd like to see more chain halts that I can analyze and use to refine the tool.
Would love others to chime in, but personally I think adding it to sdk or a tools repo so users can see it and analyse it will get you feedback. Right now it's hard to get feedback when users don't know about it.
The first feedback is this is only for iavl but the issue talks about three issues, lastresulthash, consensus hash and app hash. The tool seems to only work with app hash issues but it's unclear how a user will identify which module the issue comes from. It nots hard to recreate chain halts and test that way too.
Then perhaps a good time to put this into wider use is when the tool can debug all three types of chain halts? That is, when this issue can be closed in favor of specific issues in the tool.
If you have the time, please sketch a way to achieve realistic chain halts of the 3 interesting variants mentioned.
I don;t think I've ever seen consensus hash issues
One tool @yihuang worked on and used: https://github.com/crypto-com/python-iavl it's a step up from iavl-viewer
And a howto tutorial on probing app hash mismatch issues with iavl-viewer, written by @mmsqe and @JayT106 :
The current Cosmos SDK stores the app data with the iavl tree structure. Therefore, we need to use the iavl tooling to retrieve the application data.
CGO_CFLAGS="-I/usr/local/include/rocksdb" CGO_LDFLAGS="-L/usr/local/lib/rocksdb -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd"\
go build -tags rocksdb ./cmd/iaviewer/
It will require the system install the rocksdb, download rocksdb and do
make install
curl -O https://raw.githubusercontent.com/Homebrew/homebrew-core/5a6d7658c8686b3326f69c5dd11d08800586ad9c/Formula/rocksdb.rb && brew install rocksdb.rb
CGO_CFLAGS="-I/usr/local/Cellar/rocksdb/6.29.3/include" \
CGO_LDFLAGS="-L/usr/local/Cellar/rocksdb/6.29.3/lib -lrocksdb -lstdc++ -lm -lz -lbz2 -lsnappy -llz4 -lzstd -L/usr/local/Cellar/snappy/1.1.9/lib -L/usr/local/Cellar/lz4/1.9.3/lib/ -L /usr/local/Cellar/zstd/1.5.2/lib/" \
go build -tags rocksdb ./cmd/iaviewer/
./iaviewer [data/shape/versions/balance/nonce] [application.db path] [s/k:module name/] <version> <addr>
i.e.
export VER=1600000
export ADDR=57B4B1d6ecC292910840CEdeDE87884b254d4738
./iaviewer balance /chain/.cronosd/data/application.db/ "s/k:bank/" $VER $ADDR
version
argument requiresversion
argument requiresaddr
argument requiresaddr
argument requiresIf you are not sure which storekey
to use for every module in the project or the cosmos SDK, usually you can find it in x/[module]/key.go
in the project or the cosmos sdk
Therefore, there are some modules we can check the data status:
"s/k:bank/" "s/k:evm/" "s/k:acc/" "s/k:ibc/"
Usually, you need two data sets (one data set has normal state and another data set suspect has an incorrect state, which causes the apphash mismatch) to compare to know which part might have an issue. Therefore, to probe the data sets and then compare the diff to find which keys are different. You can compare it with the chain explorer with can query the chain by the rpc calls.
./iaviewer data /data2/data/application.db/ $MOD $VER > data_ibc_control ; ./iaviewer data /chain/.cronosd/data/application.db/ $MOD $VER > data_ibc_normal
diff data_ibc_control data_ibc_normal
we might get diff like this, it shows the accounts have different balances at height 1603101, and these are keys list:
02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F
02143B368AF83F84A63E7A1E56715EBAAA9351A6DABD6261736563726F
02145C7F8A570D578ED84E63FDFA7B1EE72DEAE1AE236261736563726F
0214F1829676DB577682E944FC3493D451B67FF3E29F6261736563726F
we ignore the values because it's to represent a hash value of the value, so we are not able to know what's the real value in it.
the key 02143299D5EEE1934480072E21D6747DABE7B4D4A73D6261736563726F
is combined with prefix, address, and denom. You need to check the implementation to know how it be stored. 0214 - prefix, 3299D5EEE1934480072E21D6747DABE7B4D4A73D - account, 6261736563726F - denom (this case is basecro)
if you compare the evm module, you can know some keys are different, i.e.
WBTC
0x062E66477Faf219F25D27dCED647BF57C3107d52
Crona LPs (Crona-LP)
0x285a569EDD6210a0410883d2E29471A6B0c7790d
Wrapped CRO (WCRO)
0x5C7F8A570d578ED84E63fdFA7b1eE72dEae1AE23
Crona LPs (Crona-LP)
0x5cc953f278bf6908B2632c65D6a202D6fd1370f9
Crona LPs (Crona-LP)
0xb4684F52867dC0dDe6F931fBf6eA66Ce94666860
USD Coin (USDC, CronosCRC20)
0xc21223249CA28397B4B6541dfFaEcC539BfF0c59
Wrapped Ether (WETH)
0xe44Fd7fCb2b1581822D0c862B68222998a0c299a
So you can guess which transaction might relate to these accounts.
The current Tendermint (v0.34.x) will do panic when it detects the apphash mismatch state. So it's difficult to start a node service, load the problem data set, and then use rpc call to check the data status, this way you might see more details. For example, call eth_getTransactionReceipt
for getting the transaction result of the problem data set. Currently, we hacked the tendermint by passing the handshake step. you can build the application with this
The Tendermint v0.35.x has another way to probe the data details, currently it hasn't been integrated into the Cosmos SDK (v0.46). Ref (TODO: update how to use inspect CLI in the Cosmos SDK)
Summary
When a chain halts we need to get as much information as possible and as easy as possible.
Problem Definition
Currently, the usual way to debug a chain halt is not that easy. We should provide tools and guides on how to debug and pinpoint the root cause of the most common failures.
Proposal
Provide tools and guides for solving:
wrong Block.Header.LastResultsHash.
wrong Block.Header.ConsensusHash.
wrong Block.Header.AppHash.
I think those are the most common errors for chain halts (besides panics).
Some stuff that comes to mind: