cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.22k stars 3.6k forks source link

[Bug]: Upgrade from v0.47.1 -> v0.47.2 causing issues #16027

Closed ToasterTheBrave closed 1 year ago

ToasterTheBrave commented 1 year ago

Summary of Bug

When upgrading from cosmos-sdk@v0.47.1 to cosmos-sdk@v0.47.2 and starting a new testnet, the testnet fails to start and produce blocks.

The same steps worked previously in cosmos-sdk@v0.47.1. After the upgrade, no blocks are being created and no obvious error messages are being produced. The best I have is the output of the status.

# curl -S localhost:26657/status | json_pp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   839  100   839    0     0  26764      0 --:--:-- --:--:-- --:--:-- 27966
{
   "id" : -1,
   "jsonrpc" : "2.0",
   "result" : {
      "node_info" : {
         "channels" : "40202122233038606100",
         "id" : "01038f88c61bbfdf38c79a6c50b207fcd1ee70f3",
         "listen_addr" : "192.168.56.21:26656",
         "moniker" : "paloma-priv-testnet-m1",
         "network" : "private-mousedove-432E2A",
         "other" : {
            "rpc_address" : "tcp://0.0.0.0:26657",
            "tx_index" : "on"
         },
         "protocol_version" : {
            "app" : "0",
            "block" : "11",
            "p2p" : "8"
         },
         "version" : "0.37.1"
      },
      "sync_info" : {
         "catching_up" : false,
         "earliest_app_hash" : "",
         "earliest_block_hash" : "",
         "earliest_block_height" : "0",
         "earliest_block_time" : "1970-01-01T00:00:00Z",
         "latest_app_hash" : "",
         "latest_block_hash" : "",
         "latest_block_height" : "0",
         "latest_block_time" : "1970-01-01T00:00:00Z"
      },
      "validator_info" : {
         "address" : "51BF73BFE54368893D50076B417B86DE71FE12E6",
         "pub_key" : {
            "type" : "tendermint/PubKeyEd25519",
            "value" : "pVW282uuYLl2NH82B8heC5GzN6B21cbCY0ATqDPm0Qo="
         },
         "voting_power" : "10000"
      }
   }
}

There are occasionally error messages, but they don't seem overly concerning as it's a host trying to add itself to the address book.

Failed to add new address
Cannot add ourselves with address 58efc00dab37e2d74a11bf846ad1b8934922368c@192.168.56.23:26656

Version

v0.47.2

Steps to Reproduce

This was found in palomachain. To reproduce, upgrade cosmos-sdk and cometbtf there and start up a new testnet. I suspect it may be reproducible in other cosmos apps, but I have not tried.

To further show issues, you can try submitting a proposal. We are submitting as such:

palomad tx gov submit-legacy-proposal evm propose-new-chain bnb-main 56 1000000 24532924 0x746f51c42fc732c46beda0be2f58d74512841eac93fa4ee5512f77938eec7bf5 \
      --title "Support for Binance Smartchain Mainnet on $CHAIN_ID" \
      --description "Add Binance Smart Chain mainnet as supported chain." \
      --deposit 10000000ugrain \
      --fees 400ugrain \
      --gas auto \
      --from="$ADDRESS" \
      --yes

And it returns

Error: rpc error: code = Unknown desc = paloma is not ready; please wait for first block: invalid height
tac0turtle commented 1 year ago

thanks for opening the issue, could you link me to your code in order to test out the changes

tac0turtle commented 1 year ago

When upgrading from cosmos-sdk@v0.47.1 to cosmos-sdk@v0.47.2 and starting a new testnet, the testnet fails to start and produce blocks.

im running a testnet and upgraded things with no issue, i tried a few times if i missed something with no luck.

I started a testnet with 0.47.1 ran it a few blocks submitted txs then stopped it and upgraded to 0.47.2 and submitted more txs with success.

I tested with 0.47.0 then upgrade to 0.47.2 as well with blocks and transactions passing.

Could you walk me through what you did here?

ToasterTheBrave commented 1 year ago

Hi @tac0turtle! Thanks for looking into this.

I am starting a fresh paloma testnet with 0.47.2 and seeing these errors. The bootstrapping of the nodes is in an ansible runbook, but I will write up a gist to reproduce this.

taariq commented 1 year ago

Okay. So this seems to be a setup issue on the Paloma side of our private testnet. @ToasterTheBrave will close.

ToasterTheBrave commented 1 year ago

I've narrowed down exactly what's causing this. It appears to be a specific set of config around logging when starting up a new testnet. Specifically, I'm seeing that log_level="debug" and log_format="json" is breaking. I've got steps to reproduce using simapp.

Environment:

Reproduction steps:

# all hosts
KEY=simapp-$(hostname)
MONIKER=simapp-priv-testnet-$(hostname)
CHAIN=private-simapp-810226

simd init $MONIKER --chain-id $CHAIN

# This is done so we don't have to set a passphrase in the example.  This has been confirmed
# not to change the example case
sed -i 's/keyring-backend = ".*"/keyring-backend = "test"/g' /root/.simapp/config/client.toml

# Here is the breaking config
sed -i 's/^log_format = ".*"/log_format = "json"/g' /root/.simapp/config/config.toml
sed -i 's/^log_level = ".*"/log_level = "debug"/g' /root/.simapp/config/config.toml

simd keys add $KEY

mkdir /root/.simapp/config/gentx

# m1
simd genesis add-genesis-account $KEY 5000000000000stake
simd genesis gentx $KEY 10000000000stake --chain-id $CHAIN
scp .simapp/config/genesis.json root@192.168.56.22:/root/.simapp/config/
scp .simapp/config/gentx/* root@192.168.56.22:/root/.simapp/config/gentx/

#m2
simd genesis add-genesis-account $KEY 5000000000000stake
simd genesis gentx $KEY 10000000000stake --chain-id $CHAIN
scp .simapp/config/genesis.json root@192.168.56.23:/root/.simapp/config/
scp .simapp/config/gentx/* root@192.168.56.23:/root/.simapp/config/gentx/

# m3
simd genesis add-genesis-account $KEY 5000000000000stake
simd genesis gentx $KEY 10000000000stake --chain-id $CHAIN
simd genesis collect-gentxs
scp .simapp/config/genesis.json root@192.168.56.21:/root/.simapp/config/
scp .simapp/config/genesis.json root@192.168.56.22:/root/.simapp/config/

# All hosts: get node id
simd tendermint show-node-id

# All hosts: Update the config.toml persistent_peers
# example values:
#   m1: de66d98c8126f3853142be5c6917a0fc59dd8802@192.168.56.22:26656,5e71fddeb6e8b1e980f97a28ccd80a2cc893f18c@192.168.56.23:26656
#   m2: 90f986726e6e9992c04c1b8715b6391928893453@192.168.56.21:26656,5e71fddeb6e8b1e980f97a28ccd80a2cc893f18c@192.168.56.23:26656
#   m3: 90f986726e6e9992c04c1b8715b6391928893453@192.168.56.21:26656,de66d98c8126f3853142be5c6917a0fc59dd8802@192.168.56.22:26656

# All hosts: start the app
simd start

Now, if we check the status of simapp, it's not producing blocks. Ex:

$ simd status | jq
{
  "NodeInfo": {
    "protocol_version": {
      "p2p": "8",
      "block": "11",
      "app": "0"
    },
    "id": "38d0ac703c2de98a4783a3b1d8a06c9678bf41de",
    "listen_addr": "tcp://0.0.0.0:26656",
    "network": "private-simapp-810226",
    "version": "0.37.1",
    "channels": "40202122233038606100",
    "moniker": "simapp-priv-testnet-m1",
    "other": {
      "tx_index": "on",
      "rpc_address": "tcp://127.0.0.1:26657"
    }
  },
  "SyncInfo": {
    "latest_block_hash": "",
    "latest_app_hash": "",
    "latest_block_height": "0",
    "latest_block_time": "1970-01-01T00:00:00Z",
    "earliest_block_hash": "",
    "earliest_app_hash": "",
    "earliest_block_height": "0",
    "earliest_block_time": "1970-01-01T00:00:00Z",
    "catching_up": false
  },
  "ValidatorInfo": {
    "Address": "3D04865B31075A120B22C5B1911930BA4B538469",
    "PubKey": {
      "type": "tendermint/PubKeyEd25519",
      "value": "03JnSP9fy2y/cJ7d8ScklYXUuaE8hZZzYOUkrrpOFf8="
    },
    "VotingPower": "10000"
  }
}
ToasterTheBrave commented 1 year ago

@alexanderbez Should this be reopened give the above ^?

tac0turtle commented 1 year ago

@julienrbrt are you able to double check this?

tac0turtle commented 1 year ago

@julienrbrt and i were unable to reproduce this issue with a variety of configurations. closing this

julienrbrt commented 1 year ago

What we can do is tagging a v0.47.3-rc.0 for you to try again. I still do not see how it can be logger related, but we have actually changed the logger in v0.47.3. My trials were indeed working.

ToasterTheBrave commented 1 year ago

@julienrbrt I'll test out the new version. In your trials, were you using three nodes? I'm curious what could possibly be different between our example setups.

ToasterTheBrave commented 1 year ago

@julienrbrt I tested in v0.47.3. This is still a problem. I need more info about how you ran your tests that didn't produce the error so I can give more context if needed.

mmsqe commented 1 year ago

@ToasterTheBrave I get same error when upgrade from v0.47.2 to v0.47.3, but seems related to logger change, test pass with cometbft fix.

ToasterTheBrave commented 1 year ago

@mmsqe Thank you for the comment! Nice to know I'm not going insane and others have seen the issue :slightly_smiling_face:. It looks like you've already opened a ticket with cometbft but didn't get any traction?

mmsqe commented 1 year ago

@mmsqe Thank you for the comment! Nice to know I'm not going insane and others have seen the issue :slightly_smiling_face:. It looks like you've already opened a ticket with cometbft but didn't get any traction?

I think they see the issue, but need more test.

julienrbrt commented 1 year ago

In your trials, were you using three nodes?

I was using in-process nodes with a tweaked simd testnet with a debug logger, so possibly this was that.

Possibly, the new cometbft release resolves your issue when used with v0.47.3: https://github.com/cometbft/cometbft/blob/v0.37.2/CHANGELOG.md#security-fixes

ToasterTheBrave commented 1 year ago

@julienrbrt Yes indeed, this is fixed for us now with cometbft v0.37.2 :slightly_smiling_face: