cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.16k stars 3.57k forks source link

[Bug]: [Cosmovisor] Upgrade is applied immediately when using the --upgrade-height flag #19227

Closed dasanchez closed 2 weeks ago

dasanchez commented 7 months ago

Is there an existing issue for this?

What happened?

Cosmovisor restarts with the new binary immediately after running add-upgrade with the --upgrade-height flag.

Cosmos SDK Version

Cosmovisor v1.5.0

How to reproduce?

  1. Start a Gaia v13 chain with Cosmovisor v1.5.0
  2. Download a Gaia v14 binary to gaiad
  3. Run cosmovisor add-upgrade v14 gaiad --upgrade-height 300 at any point before reaching block 300.
  4. The node will start right away with the downloaded binary.

I used this script to run steps 2 and 3:

#!/bin/bash

export DAEMON_NAME=gaiad
export DAEMON_HOME=/home/gaia/.gaia
export DAEMON_DATA_BACKUP_DIR=/home/gaia/backup
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."
wget https://github.com/cosmos/gaia/releases/download/v14.0.0-rc0/gaiad-v14.0.0-rc0-linux-amd64 -q -O gaiad
chmod +x gaiad
command="cosmovisor add-upgrade v14 gaiad --upgrade-height 300"
echo "$command"
$command
echo "Waiting ten seconds..."
sleep 10
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."

Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2) Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

julienrbrt commented 7 months ago

Can you dump the upgrade info here too? And the sdk versions as well.

dasanchez commented 7 months ago

@julienrbrt Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2) Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

julienrbrt commented 7 months ago

Thanks, last question, what gives gaiad status just before add-upgrade ?

dasanchez commented 7 months ago

@julienrbrt gaiad status returns this before running add-upgrade:

{"NodeInfo":{"protocol_version":{"p2p":"8","block":"11","app":"0"},"id":"af3e918c3e2533ceb3be8c48c3068bf6682fb67c","listen_addr":"tcp://0.0.0.0:26656","network":"testnet","version":"0.34.29","channels":"40202122233038606100","moniker":"cosmos-node","other":{"tx_index":"on","rpc_address":"tcp://0.0.0.0:26657"}},"SyncInfo":{"latest_block_hash":"760D16D9D591C769184A2A4F659AB730C80494BD84633FC0451EFABE524307F4","latest_app_hash":"67F695FF4B37D5D12F9A64B83F32679DE289F3AC3A62427CC2EEFC84B61C1634","latest_block_height":"8","latest_block_time":"2024-01-24T19:12:39.769313361Z","earliest_block_hash":"067188F2ADF5931924FD7562506F7DF457C2329F2A1AF00BB777E12FEA273A29","earliest_app_hash":"E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855","earliest_block_height":"1","earliest_block_time":"2024-01-24T17:08:52.990568831Z","catching_up":false},"ValidatorInfo":{"Address":"51A4AFE4B661E141691029CABF023E3C976FF28E","PubKey":{"type":"tendermint/PubKeyEd25519","value":"kkk4KLWvqb8AKcJJcO5VpBvbOqfds7XqqZ6HRxzZPz0="},"VotingPower":"8000"}}
julienrbrt commented 7 months ago

Weird, then it should be parsed properly: https://github.com/cosmos/cosmos-sdk/blob/main/tools/cosmovisor/scanner.go#167

fmorency commented 3 months ago

I also got bitten by that issue!

julienrbrt commented 3 months ago

I also got bitten by that issue!

:/ I am re-prioritizing cosmovisor work. I'll re-investigate this.

steven-varga commented 1 month ago

I am observing the same behaviour with dydxv4 cosmos scheduled for Jul 25th 2024, 07:50:07+00:00 UTC (in 10 hours) Anyone happened to know if downgrade is an option?

BillyJunID commented 1 month ago

I have encountered this issue with the Warden testnet as well. command used:

DAEMON_NAME=wardend
DAEMON_HOME=/home/warden/.warden
UPGRADE_HEIGHT=1534500

rm -rf ${HOME}/upgrade
mkdir ${HOME}/upgrade
cd ${HOME}/upgrade
wget https://github.com/warden-protocol/wardenprotocol/releases/download/v0.4.0/wardend_Linux_x86_64.zip
unzip wardend_Linux_x86_64.zip
rm wardend_Linux_x86_64.zip CHANGELOG.md LICENSE README.md
chmod +x ${HOME}/upgrade/wardend
cosmovisor add-upgrade "v040" ${HOME}/upgrade/wardend --upgrade-height ${UPGRADE_HEIGHT}

After executing cosmovisor add-upgrade, the node immediately restarts and uses the new binary, resulting in a crash.

freak12techno commented 1 month ago

I faced this yesterday on Neutron emergency upgrade on 2 nodes. Might be because I used --force flag as there was already an upgrade-info.json file there.

@julienrbrt can we get this prioritised please? as this is quite dangerous

Bosco-2019 commented 1 month ago

Seeing this same issue on Cosmos Provider testnet, the issue seems to be random as I have an identical node both used with horcrux one worked fine the other tried applying the upgrade immediately.

fish2plain commented 3 weeks ago

experiencing the same issue on current theta and provider testnet.

julienrbrt commented 3 weeks ago

Looking at this again right now!

pserdiuk-everstake commented 3 weeks ago

@julienrbrt I see a potential problem here, in scanner.go

    result, err := exec.Command(fw.currentBin, "status").Output() //nolint:gosec // we want to execute the status command
    if err != nil {
        return 0, err
    }
    // file exist but too early in height
    currentHeight, _ := fw.checkHeight()
    if currentHeight != 0 && currentHeight < info.Height {
        return false
    }

Avoid launching the node binary like that as much as possible, especially without any command-line options. A more thorough way to get data from the running node is to parse config/config.toml once, find its current RPC endpoint, and send an RPC request each time you need to get the current block height. E.g. some of my shell code:

# Function to get the local HTTP RPC endpoint from the node's configuration
# TODO: This is currently limited to Neutron. Cosmos SDK exposes a different "config" command (`config get config rpc.laddr`), will need to change this function to recognize that
get_local_rpc_endpoint() { "$NODE_BINARY" config --home "$NODE_DIR" | jq -er .node | sed s/tcp/http/; }

# Function to send a call to the HTTP RPC endpoint
call_rpc_endpoint() { curl -s --connect-timeout 5 --max-time 20 --retry 5 --retry-delay 0 --retry-max-time 5 "$1/$2" | jq -er .result; }

# Function to get the latest block height via HTTP RPC
get_latest_block_height() { call_rpc_endpoint "$1" "status?" | jq -er .sync_info.latest_block_height; }

# Function to check if we have proper access to the local node
validate_local_node() {
  LOCAL_RPC_ENDPOINT=""
  # Check if the node's binary is readable and executable and if the node's directory exists and is readable
  if [ -d "$NODE_DIR" ] && [ -r "$NODE_DIR" ] && [ -x "$NODE_BINARY" ] && [ -r "$NODE_BINARY" ]; then
    LOCAL_RPC_ENDPOINT=$(get_local_rpc_endpoint); [ -z "$LOCAL_RPC_ENDPOINT" ] && print_error_and_exit "node_error"
    return 0;
  else return 1; fi
}

# Main function to compare heights between the local node and the remote RPC
compare_heights() {
  local_height=$(get_latest_block_height "$LOCAL_RPC_ENDPOINT")
  remote_height=$(get_latest_block_height "$REMOTE_RPC_ENDPOINT")
  { [ -z "$local_height" ] || [ -z "$remote_height" ]; } && print_error_and_exit "curl_error"
  delta=$((remote_height - local_height))
  [ "$delta" -le "$NODE_LAG" ] && \
  { printf "Node %s is not lagging, delta with %s is %s, local height is %s, remote height is %s\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 0; } || \
  { printf "Node %s is lagging, delta with %s is %s, local height is %, remote height is %s!\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 1; }
}

This is significantly faster because it avoids launching another executable, even if for a moment (I'm still doing that but you can use something like a mandatory DAEMON_RPC_ENDPOINT variable instead). It is compatible with nodes that run the RPC server on a different port, which gaiad status won't pick up by itself without --node. If there are multiple nodes on one machine, this code can end up requesting the status of the node that runs on the default RPC port and using that value of latest_block_height to check if the upgrade should be applied immediately to a different node. It also avoids possible side effects related to outputs and exit codes of the status command. For example, in some earlier versions of CosmosSDK I noticed that this command was printing its output to stderr, not stdout, as did the main logger. Finally, it could resolve your TODO problem related to test binaries :) Of course, I don't know if either of these situations is the culprit behind this specific problem, but refactoring this part of the code to use an RPC call is a good idea anyway and will aid in any further debugging.

freak12techno commented 2 weeks ago

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

julienrbrt commented 2 weeks ago

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

Hey, now: https://github.com/cosmos/cosmos-sdk/releases/tag/tools%2Fcosmovisor%2Fv1.6.0

Bosco-2019 commented 1 week ago

@julienrbrt unfortunately this has not solved the problem. I just tried this on neutron manual upgrade.

julienrbrt commented 1 week ago

@julienrbrt unfortunately this has not solved the problem. I just tried this on neutron manual upgrade.

Hi, Could you post more info? (cosmovisor version and reproducing steps) During my tests it was working fine.

Bosco-2019 commented 1 week ago

Hi, Could you post more info? (cosmovisor version and reproducing steps) During my tests it was working fine.

cosmovisor version: v1.6.0

cosmovisor add-upgrade v4.2.1 ~/go/bin/neutrond --force --upgrade-height 13711950

EDIT: Scratch that @julienrbrt I see where i messed up. While i did upgrade cosmovisor I didn't restart the process before running the add-upgrade command so I was still actually on v1.5.0 when running the command. I have just tested again with a fake upgrade using the same binaries and it did indeed stop the node at the correct height to apply the upgrade.

Thank you.