cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.27k stars 3.63k forks source link

[Feature]: rollback store version based off of abci handshake #15687

Closed tac0turtle closed 1 year ago

tac0turtle commented 1 year ago

Summary

With the adoption of abci 2.0, the comet team has moved apphash from commit to finalize block. This came with valid reasoning but it changes an assumption with how IAVL and potential other merkle trees may work.

Problem Definition

in the current design of iavl, we write to disk then compute the hash, this goes against what is expected by comet. The assumption made was that we would compute the hash in memory then write to disk in the commit phase. While this is a safe assumption in a minimal application, in a complex application current state size could be fairly large, putting a large burden on node operators.

Proposal

After speaking with the comet team, @sergio-mena mentioned that the store height of comet is provided to the application in the abci handshake. This provides a way for the sdk to know what height comet is at and how it compares to the height stored in iavl. If the height stored in iavl is higher than comets, we should rollback to the height which comet has. This will allow comet to replay blocks against the application in order to catchup or recover from a crash.

@sergio-mena is there anything you would like to add?

cc @cool-develope @yihuang

alexanderbez commented 1 year ago

Makes sense, but I'm missing what this has to do with getting the app hash in FinalizeBlock vs Commit?

tac0turtle commented 1 year ago

this is only needed for crash recovery, so if the node crashes before commit we already wrote to disk, so we would see our height is higher than comets and rollback for the replay

yihuang commented 1 year ago

The tricky thing with rollback is fast node index, otherwise, it's trivial.

alexanderbez commented 1 year ago

I think we might get rid of fast node, right @tac0turtle?

sergio-mena commented 1 year ago

Thanks @tac0turtle for the write-up. I agree with the reasoning in general. The one thing I'd like to mention is a quirk in ABCI. At handshake (after recovery), it is actually CometBFT the one receiving the height (and hash) from the App (and not the other way around). Hence my suggestion that you might want to:

With this, you'll be sure to never be ahead of CometBFT, and so the crash-recovery mechanism will work in all cases.

tac0turtle commented 1 year ago

I think we might get rid of fast node, right @tac0turtle?

in store v2 yes, but dont think that will land in the Eden release.

Talking with @cool-develope we may have a way around needing this

sergio-mena commented 1 year ago

Awesome work! Thanks for this!