AppHash does not match after upgrading to new release of BigchainDB

charlespetchsy commented 5 years ago

Bug Report

What computer are you on (hardware)? A cluster of 4 nodes
What operating system are you using, including version. e.g. Ubuntu 14.04? Fedora 23? Ubuntu 16.04 LTS
What version of BigchainDB software were you using? Is that the latest version? v2.0.0b5

I am running 4 instances of BigchainDB with MongoDB 3.6 and Tendermint 0.22.8. It is a stock version of bigchaindb which has been installed using pip3

When upgrading BigchainDB after every release, the database has to be dropped every single time a new version is being installed or else it produces a hash error where Tendermint would not started.

The following log is from bigchaindb.log:

[2018-08-06 21:02:08] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:11] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:14] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:17] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:20] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:23] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:26] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:29] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:32] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:35] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:38] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)
[2018-08-06 21:02:41] [WARNING] (bigchaindb.event_stream) WebSocket connection failed with exception Cannot connect to host localhost:26657 ssl:None [Connect call failed ('127.0.0.1', 26657)] (bigchaindb_ws_to_tendermint - pid: 10041)

And the hash error is from Tendermint:

ABCI Replay Blocks                           module=consensus appHeight=98 storeHeight=0 stateHeight=0
panic: Tendermint state.AppHash does not match AppHash after replay. Got 33656534316538633932343064633838633532306536653864663438323137306335356166373461306638353631653164343030313637363139633530656462, expected

At this point, Tendermint is unable to run so I’d have to clear Tendermint and drop the database via bigchaindb -y drop and reload the assets.

My upgrade procedure for each node in the cluster

1) sudo -H pip3 uninstall bigchaindb==2.0.0b4 2) sudo -H pip3 install bigchaindb==2.0.0b5

kansi commented 5 years ago

ABCI Replay Blocks                           module=consensus appHeight=98 storeHeight=0 stateHeight=0

Looking at the above log, I would like to know how are you running the system (native or docker)? It seems that Tendermint's logs are lost somehow as the storeHeight=0 stateHeight=0 where as appHeight=98 which is the number of blocks committed.

charlespetchsy commented 5 years ago

@kansi I had to clear the system and create a new setup. The output above is just a reproduction of the error on a fresh machine. I'm also running the system natively without docker.

ldmberman commented 5 years ago

@charlespetchsy how do you upgrade to a new BigchainDB version? Do you reset Tendermint (e. g. via tendermint_unsafe_reset_all)?

charlespetchsy commented 5 years ago

@ldmberman I upgrade BigchainDB using pip3 and yes I reset Tendermint with tendermint_unsafe_reset_all. I only reset Tendermint when it no longer wants to connect with the BigchainDB cluster. The current process of upgrading now consists of resetting everything and re-creating each transaction.

ldmberman commented 5 years ago

@charlespetchsy right now we are not supporting the kind of replay when Tendermint is behind. So if you reset Tendermint, the node becomes non-operational.

I am investigating if we can introduce support for such replay right now, but in any case there is a question of why Tendermint did not connect to BigchainDB after the upgrade. Could you provide Tendermint and BigchainDB logs from the time they failed to connect?

ldmberman commented 5 years ago

@charlespetchsy it's actually impossible to replay the blocks if Tendermint falls behind, it is not supposed to happen.

bigchaindb / bigchaindb

AppHash does not match after upgrading to new release of BigchainDB #2472

Bug Report

My upgrade procedure for each node in the cluster