cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.18k stars 3.57k forks source link

[Epic]: Rolling Upgrades #18523

Open tac0turtle opened 9 months ago

tac0turtle commented 9 months ago

Summary

The Cosmos SDK upgrade process has been that validators either need to use cosmovisor, be present at the time of the upgrade or have a third party tool in order to start the new binary while shutting down the old one. Much of the ecosystem has become accustom to this method, but it has caused a growth in maintenance from application developers.

Secondly, if you would like to sync from genesis then it is a mess to get all the right versions lined up in cosmovisor, even then its its unclear if the binaries will work as intended, barring there was no emergency binary issued by the team. This leads to many people not being able to sync from genesis on newer chains.

Note: If the node operators have archives nodes, then it is not possible to allow querying of old versions through the running binary. Secondary or third binaries need to be provided in order to query the old state.

For the reasons listed above and those not listed, we would like to explore rolling upgrades.

A rolling upgrade is when node operators can upgrade binaries ahead of time allowing the chain to upgrade on its own without intervention by the developers or node operators. This will simplify the operation of a node, allow node operators to sync from genesis and will allow historical versions to be run without needing to operate many different binaries.

Goals

The goals of this work are:

Problem Definition

Upgrades are cumbersome for node operators, from being awake at all hours of the day for an upgrade to making sure you upgrade at the correct time. Application developers have a larger burden to maintain historical binaries and hope that the block protocol will not change from version to version.

Work Breakdown

As we have adopted protobuf in the Cosmos SDK there are some gotchas with how this can be done.

We should work on a few demos in different directions for how to achieve many different app versions. This will help influence the final design.

This is meant as a tracking issue and will be updated once we are ready to begin this work.

alexanderbez commented 9 months ago

Re; Backwards Compatibility, this would only be true up and until the rolling upgrade switches over the new binary, at which point, the queries would no longer work. Is my understanding correct?

tac0turtle commented 9 months ago

that is incorrect. There is no switching of binaries at upgrade height. The goal is to allow a single binary to run multiple versions of apps. This would enable users to be able to query historical values with ease

alexanderbez commented 9 months ago

The issue description doesn't allude to how this would be done nor does it make it clear that binaries aren't switched. How do you have multiple versions of the app in the same process?

robert-zaremba commented 9 months ago

I'm happy with cosmovisor. It's also more with the Unix spirit. TBH, I don't see when cosmovisor doesn't work. If needed, core team, validators or community can keep track of the cosmovisor directory tree, and provide a script to download or build all necessary versions.

robert-zaremba commented 9 months ago

In fact a script should be enough:

repo_dir=<repo dir>
cd $repo_dir
git clone <repo>

git checkout <release/1>
make build
mkdir ../<upgrade1>; 
cp build/<app> ../<upgrade1>

git checkout <release/2>
make build
mkdir ../<upgrade2>; 
cp build/<app> ../<upgrade2>

....
alexanderbez commented 9 months ago

I think it will be very challenging to do this in a single parent process without using height gates. Height gates are pretty much the only way unless you have child process management. Cosmovisor works well, I agree.

Looking to see what cool ideas @tac0turtle and others comes up with.

tac0turtle commented 9 months ago

Height gates will be needed, yes. That is the only way. Cosmovisor is horrible UX and you are hoping that the p2p block sync protocol does not break. IF it does there will be no way to resync from genesis. Mentioned this above.