Consensys / quorum

A permissioned implementation of Ethereum supporting data privacy
https://www.goquorum.com/
GNU Lesser General Public License v3.0
4.69k stars 1.3k forks source link

Losing blockchain data after removing raft log folders #1284

Closed thpun closed 2 years ago

thpun commented 2 years ago

System information

Quorum release version: 21.7.1

OS & Version: Ubuntu 20.04 running Quroum in docker

Expected behaviour

Blockchain data persists after removing raft log folders (i.e. quorum-raft-state/, raft-snap/ & raft-wal), as written in https://consensys.net/docs/goquorum/en/latest/configure-and-manage/manage/node-network-migration/#peers-need-a-new-networking-configuration

This forces Raft to refresh the cluster state based on the latest information in the static-nodes.json without losing any of the blockchain data.

Actual behaviour

When all 3 folders (i.e. quorum-raft-state/, raft-snap/ & raft-wal) are removed, the block number drops to zero. When only quorum-raft-state/ & raft-snap/ are removed, all nodes encountered panic.

Steps to reproduce the behaviour

I started off with the classic 7-node example in https://github.com/ConsenSys/quorum-examples with command PRIVATE_CONFIG=ignore QUORUM_CONSENSUS=raft docker-compose up -d and then edited the docker-compose.yml in order to avoid data loss when restarting the docker containers (removing those lines which modifies and initializes /qdata and /qdata/dd).

After running for some time and having some transactions in the blockchain, the block number grows to 119.

Following https://consensys.net/docs/goquorum/en/21.10.0/configure-and-manage/manage/node-network-migration/#peers-need-a-new-networking-configuration, I am trying to add a new peer which is not in the same machine with the original 7 nodes, communicating via public network.

My procedure would be:

  1. Stop all 7 nodes
  2. Modify the docker-compose.yml and remove the network bridge part (coz I suspect the bridge somehow prevents the raft communication from outside to the 7 nodes), letting the 7 nodes' raft port bind with the host (50400 to 50406)
  3. Modify permissioned-nodes.json and static-nodes.json
  4. Remove the raft folders
  5. Start all 7 nodes

In step 4, when all 3 folders are removed, block number in all nodes goes back to zero (which I refer this as losing blockchain data) once the nodes are started.

By restoring the blockchain by docker volume backup, and retry everything from step 1. And in step 4, when only quorum-raft-state/ and raft-snap are removed, the nodes run into panic.

DEBUG[12-20|03:44:46.557] Inserted new block                       number=116 hash="820a01…fca558" uncles=0 txs=1 gas=110343  elapsed="579.889µs" root="1f7e67…199f65"
INFO [12-20|03:44:46.557] Imported new chain segment               blocks=1 txs=1 mgas=0.110 elapsed="762.451µs" mgasps=144.721  number=116 hash="820a01…fca558" dirty=1.16MiB
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Reinjecting stale transactions           count=0
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.557] Reinjecting stale transactions           count=0
DEBUG[12-20|03:44:46.558] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.558] Persisted trie from memory database      nodes=0 size=0.00B     time="3.569µs"  gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
DEBUG[12-20|03:44:46.559] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.559] Inserted new block                       number=117 hash="d823a0…f7d46a" uncles=0 txs=1 gas=484985  elapsed=1.856ms     root="e19434…340644"
INFO [12-20|03:44:46.559] Imported new chain segment               blocks=1 txs=1 mgas=0.485 elapsed=1.916ms     mgasps=252.993  number=117 hash="d823a0…f7d46a" dirty=1.19MiB
DEBUG[12-20|03:44:46.559] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.559] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.559] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.560] Persisted trie from memory database      nodes=0 size=0.00B     time="1.412µs"  gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
DEBUG[12-20|03:44:46.560] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.560] Inserted new block                       number=118 hash="0d67a5…2ccb20" uncles=0 txs=1 gas=110343  elapsed="667.712µs" root="2b3796…172b1b"
INFO [12-20|03:44:46.560] Imported new chain segment               blocks=1 txs=1 mgas=0.110 elapsed="772.235µs" mgasps=142.888  number=118 hash="0d67a5…2ccb20" dirty=1.19MiB
ERROR[12-20|03:44:46.560] error decoding block:                    err=EOF
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Reinjecting stale transactions           count=0
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.560] Reinjecting stale transactions           count=0
DEBUG[12-20|03:44:46.560] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.560] Persisted trie from memory database      nodes=0 size=0.00B     time="1.535µs"  gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
DEBUG[12-20|03:44:46.560] AccountExtraData root after trie commit  root="56e81f…63b421"
DEBUG[12-20|03:44:46.560] Inserted new block                       number=119 hash="0f70dd…3d63bf" uncles=0 txs=1 gas=21000   elapsed="472.024µs" root="589ade…d3196c"
INFO [12-20|03:44:46.560] Imported new chain segment               blocks=1 txs=1 mgas=0.021 elapsed="550.293µs" mgasps=38.161   number=119 hash="0f70dd…3d63bf" dirty=1.19MiB
ERROR[12-20|03:44:46.560] error decoding block:                    err=EOF
INFO [12-20|03:44:46.560] startRaft                                raft ID=2
INFO [12-20|03:44:46.560] remounting an existing raft log; connecting to peers.
raft2021/12/20 03:44:46.561004 INFO: newRaft storagehardState{90 2 127 []} confState{[1] [] []}
raft2021/12/20 03:44:46.561012 INFO: newRaft config.peers[1] config.learners[]
raft2021/12/20 03:44:46.561021 INFO: 2 became follower at term 90
raft2021/12/20 03:44:46.561031 INFO: newRaft 2 learner: false [peers: [1], term: 90, commit: 127, applied: 1, lastindex: 129, lastterm: 13]
DEBUG[12-20|03:44:46.561] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.561] Account Extra Data root                  hash="000000…000000"
DEBUG[12-20|03:44:46.561] Reinjecting stale transactions           count=0
INFO [12-20|03:44:46.561] raft node started
WARN [12-20|03:44:46.561] -------------------------------------------------------------------
WARN [12-20|03:44:46.561] Referring to accounts by order in the keystore folder is dangerous!
WARN [12-20|03:44:46.561] This functionality is deprecated and will be removed in the future!
WARN [12-20|03:44:46.561] Please use explicit addresses! (can search via `geth account list`)
WARN [12-20|03:44:46.561] -------------------------------------------------------------------
INFO [12-20|03:44:46.563] confChange                               confState="{Nodes:[1 2] Learners:[] XXX_unrecognized:[]}"
INFO [12-20|03:44:46.563] ConfChangeAddNode                        raft id=2
INFO [12-20|03:44:46.563] ignoring expected ConfChangeAddNode for initial peer raft id=2
INFO [12-20|03:44:46.563] start snapshot                           applied index=1 last snapshot index=1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1153061]

goroutine 596 [running]:
github.com/ethereum/go-ethereum/raft.(*ProtocolManager).buildSnapshot(0xc0003a6480, 0x0)
        github.com/ethereum/go-ethereum/raft/snapshot.go:71 +0x2c1
github.com/ethereum/go-ethereum/raft.(*ProtocolManager).triggerSnapshot(0xc0003a6480, 0x2)
        github.com/ethereum/go-ethereum/raft/snapshot.go:100 +0x18d
github.com/ethereum/go-ethereum/raft.(*ProtocolManager).eventLoop(0xc0003a6480)
        github.com/ethereum/go-ethereum/raft/handler.go:964 +0x1014
created by github.com/ethereum/go-ethereum/raft.(*ProtocolManager).startRaft
        github.com/ethereum/go-ethereum/raft/handler.go:604 +0x95c

(If i dont remove the raft folder and the original docker network bridge, the new peer 8 in another machine will not be able to communicate with the raft network with either rafthttp: failed to find member 3 in cluster 1000 in the new node (peer 8) or rafthttp: failed to dial 8 on stream MsgApp v2 (peer 8 failed to find local node 1) in the original nodes. But if i remove the raft folder, the blockchain data is lost 😢 )

baptiste-b-pegasys commented 2 years ago

thank you for raising this. Have you tried without docker? the 7nodes examples can run without docker.

I made a simple test, stop every nodes, remove all raft folders, start again and I still have my contract and transaction, by running the 7nodes examples without docker.

I've done same thing with docker compose, and it's absent as you mention. (CTRL-C on docker-compose up, deleting the every raft folders only on the volumes, docker-compose up)

same with docker-compose stop and start.

baptiste-b-pegasys commented 2 years ago

did docker-compose stop and docker-compose start without removing the raft folders, I loose the blockchain. My guess is that when the docker is starting, there is the geth init that is applied over existing qdata

baptiste-b-pegasys commented 2 years ago

Resolved by editing/changing the docker-compose file, docker-compose.yml.txt Here I removed all rm, mkdir, geth init for quorum and tessera containers.

So the setup is done with the original docker-compose, then you should use this one.

I will discuss if we should have a better docker-compose file that init once, and reuse the volume. docker-compose down doesn't remove the volumes by default.

antonydenyer commented 2 years ago

Hopefully this solved your issue, if not please re-open