Closed alejandroalffer closed 2 years ago
@alejandroalffer I see references to go1.15.2, making me presume that you have been compiling from sources, could you please instead use either
Then retry and let us know if the issue persists.
Thanks
Thanks for the feedback, @nmvalera!
In fact, current geth version is compiled from sources, in a Dockerized ubuntu:18.04
root@e8cdc103b174:~# geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.15.2
Operating System: linux
GOPATH=
GOROOT=/usr/local/go
root@e8cdc103b174:~# ldd /usr/local/bin/geth
linux-vdso.so.1 (0x00007ffebaad8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb5fc23b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb5fc033000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb5fbc95000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb5fb8a4000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb5fc45a000)
Downloading geth as proposed:
root@e8cdc103b174:~# md5sum /usr/local/bin/geth-from-sources /usr/local/bin/geth
40c6fe1443d4294824de5ff4f58ce855 /usr/local/bin/geth-from-sources
b68db91b96b1808daa24bb27b000aeb4 /usr/local/bin/geth
root@e8cdc103b174:~# /usr/local/bin/geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64
root@e8cdc103b174:~# ldd /usr/local/bin/geth
linux-vdso.so.1 (0x00007fff4aff8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb2d68ec000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb2d66e4000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb2d6346000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb2d5f55000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb2d6b0b000)
It returns the same issue:
WARN [01-12|15:36:51.451|eth/downloader/downloader.go:336] Synchronisation failed, dropping peer peer=ae385305ccad4d03 err="retrieved hash chain is invalid"
ERROR[01-12|15:36:52.495|core/blockchain.go:2214]
########## BAD BLOCK #########
Chain config: {ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> TransactionSizeLimit: 64 MaxCodeSize: 24 Petersburg: <nil> Istanbul: <nil> PrivacyEnhancements: <nil> Engine: istanbul}
Number: 8597101
Hash: 0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5
0: cumulative: 48864 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0x5136041eb879d49699e76bf64aed8207376cd0d1f42aa20d80613bad309bece4 logs: [0xc0003e13f0 0xc0003e14a0] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000000001000000000000000000000000400000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000002000000000000000200000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:
1: cumulative: 97728 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xb0e8e529893614560fcd421310d68cd03794fe8a22e36d5140ba6cde5b4300af logs: [0xc0003e1550 0xc0003e1600] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000000001000000000000000000000000400000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000002000000000000000200000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:
Error: invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)
##############################
WARN [01-12|15:36:52.522|eth/downloader/downloader.go:336] Synchronisation failed, dropping peer peer=e01dc34eba4860ea err="retrieved hash chain is invalid"
^CINFO [01-12|15:36:53.444|cmd/utils/cmd.go:75] Got interrupt, shutting down...
INFO [01-12|15:36:53.444|node/node.go:443] http endpoint closed url=http://127.0.0.0:22000
INFO [01-12|15:36:53.445|node/node.go:373] IPC endpoint closed url=/root/alastria/data/geth.ipc
INFO [01-12|15:36:53.445|core/blockchain.go:888] Blockchain manager stopped
INFO [01-12|15:36:53.445|eth/handler.go:291] Stopping Ethereum protocol
INFO [01-12|15:36:53.446|eth/handler.go:314] Ethereum protocol stopped
INFO [01-12|15:36:53.446|core/tx_pool.go:408] Transaction pool stopped
INFO [01-12|15:36:53.446|ethstats/ethstats.go:131] Stats daemon stopped
I'll try:
@nmvalera , do you think that we have the same problem with the chain as we have showed you on the issue: https://github.com/ConsenSys/quorum/issues/1108?
Hi!
The problem keeps while full syncing using the provided binary :-( . Using fast mode, everything finish in the right way
export PRIVATE_CONFIG=ignore
geth --datadir /root/alastria/data --networkid 83584648538 --identity VAL_DigitelTS_T_2_8_01 --permissioned --cache 4096 --port 21000 --istanbul.requesttimeout 10000 --ethstats VAL_DigitelTS_T_2_8_01:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --verbosity 3 --emitcheckpoints --targetgaslimit 8000000 --syncmode full --vmodule consensus/istanbul/core/core.go=5 --debug --vmdebug --nodiscover --mine --minerthreads 2
This was a fresh database, after "geth removedb --datadir /root/alastria/data_DONOTCOPYPASTER" and "geth --datadir /root/alastria/data init /root/genesis.json", and restoring the original enode key.
root@e8cdc103b174:~# cat /root/genesis.json #the standar in Alastria
{
"alloc": {
"0x58b8527743f89389b754c63489262fdfc9ba9db6": {
"balance": "1000000000000000000000000000"
}
},
"coinbase": "0x0000000000000000000000000000000000000000",
"config": {
"chainId": 83584648538,
"byzantiumBlock": 0,
"homesteadBlock": 0,
"eip150Block": 0,
"eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"eip155Block": 0,
"eip158Block": 0,
"istanbul": {
"epoch": 30000,
"policy": 0
},
"isQuorum": true
},
"extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f85ad594b87dc349944cc47474775dde627a8a171fc94532b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
"gasLimit": "0x2FEFD800",
"difficulty": "0x1",
"mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
"nonce": "0x0",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"timestamp": "0x00"
}
The binary file you provided:
root@e8cdc103b174:~# /usr/local/bin/geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64
root@e8cdc103b174:~# md5sum /usr/local/bin/geth
b68db91b96b1808daa24bb27b000aeb4 /usr/local/bin/geth
The binary compiled from myself:
root@e8cdc103b174:~# /usr/local/bin/geth-from-sources version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.15.2
Operating System: linux
GOPATH=
GOROOT=/usr/local/go
Could any of these files help in the solution?
[...]
-rw-r--r-- 1 root root 2146445 Jan 14 07:31 010075.ldb
-rw-r--r-- 1 root root 180033388 Jan 14 07:39 010077.ldb
-rw-r--r-- 1 root root 1663848 Jan 14 07:39 MANIFEST-000004
-rw-r--r-- 1 root root 136328749 Jan 14 08:41 010076.log
-rw-r--r-- 1 root root 555890 Jan 14 08:41 LOG
[...]
In order to make the same test using the Docker provided by quorum... could I have access to the original Dockerfile used in https://hub.docker.com/r/quorumengineering/quorum?
@nmvalera , do you think that we have the same problem with the chain as we have showed you on the issue: #1108?
I'm pretty sure the problem it's diferent: this one it's about syncing a new node in the full-way mode, and the @carlosho17 issue it's related with the new storage model for chain database
Better degub attached:
root@e8cdc103b174:~# geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64
Geth arguments:
geth --datadir /root/alastria/data --networkid 83584648538 --identity VAL_DigitelTS_T_2_8_01 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats VAL_DigitelTS_T_2_8_01:_DONOT_SHOW@_DONOT_SHOW:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password NO_CLEAN --metrics.influxdb.tags host=VAL_DigitelTS_T_2_8_01 --verbosity 5 --cache 10 --nousb --maxpeers 200 --nousb
Log error aflter a fresh chaindb install err.full.gz
Looking at the logs I notice that you haven't cleared the freezer db:
INFO [01-16|08:48:01.963] Opened ancient database database=/root/alastria/data/geth/chaindata/ancient
DEBUG[01-16|08:48:01.964] Ancient blocks frozen already number=8597100 hash=e4d6ea…6ca9ca frozen=5434860
So you're getting the BAD BLOCK on the first block your node is trying to download during the sync (block 8597101). It may be worthwhile performing freezer delete in addition to the chaindb removal. So that you start with a completely clean node.
Thank you for the answer, @SatpalSandhu61. The problem persist after a clean up of the chaindb:
The logs starts in the incorrect number, after restarting geth, only for a make it smaller.
I believe you may not have fully understood my comment regarding clearing the freezer db. Please read the section on freezer, which was introduced with the merge from v1.9.7 upstream geth:https://blog.ethereum.org/2019/07/10/geth-v1-9-0/
Hi
just to recap on this issue for the Alastria Quorum Network.
We all stumble upon a certain block when using geth 1.9.7 that yields this message
DEBUG[01-15|11:28:01.360] Downloaded item processing failed number=8597101 hash=e4a2d7…49b2e5 err="invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)"
We have spent last weeks trying all scenarios (fast and full sync, erasing whole data directory and reinitializing with geth init preserving nodekey, fresh new installations, different Ubuntu versions, Quorum tgz package, in-place compilation with Go 1.15 and 1.13 , etc). These tests have been performed not only by us Core Team but also by regular members.
The result is always the same: It is block 8597101 where newer quorum finds a bad merkle root and stops syncing.
Our workaround is: install older version , let it sync past beyond block 8597101, and then switch to quorum 20.10.x . There is a second workaround which is start fresh but with a copy of the chain past beyond the bad block.
What we would like to know if the finding of a bad merkle-root by the quorum 20.10.x is a feature or a bug.
Thank you
The problem still persists anyway in the new version 21.1.0: the sync process stop forever in block 8597100, using full mode.
We are using a new database, starting the sync from scratch. The problem is repeated in all cases:
$ export PRIVATE_CONFIG=ignore
$ geth --datadir /root/alastria/data --networkid 83584648538 --identity BOT_DigitelTS_T_2_8_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats BOT_DigitelTS_T_2_8_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password ala0str1AX1 --metrics.influxdb.tags host=BOT_DigitelTS_T_2_8_00 --verbosity 5 --cache 8192 --nousb --maxpeers 256
instance: Geth/VAL_DigitelTS_T_2_8_01/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5
> eth.syncing
{
currentBlock: **8597100**,
highestBlock: 61148125,
knownStates: 0,
pulledStates: 0,
startingBlock: 0
}
I have created a new log file with the last lines: they are repeated forever.
In order to progress with this problem, we could allow an enode adress in for developers to do their own testing.
The Alastria ecosystem, with more than 120 nodes, is pending this issue to proceed with the version migration.
Last few lines from linux console: sync-fails.txt
Full trace from start of the synchronization:
geth --datadir /root/alastria/data --networkid 83584648538 --identity BOT_DigitelTS_T_2_8_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats BOT_DigitelTS_T_2_8_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password ala0str1AX1 --metrics.influxdb.tags host=BOT_DigitelTS_T_2_8_00 --verbosity 5 --cache 8192 --nousb --maxpeers 256 --vmdebug 2> /root/alastria/data/full_sync
https://drive.google.com/file/d/1rx7bzJdygwomRBMfRn3Bftczf6nwuAeJ/view?usp=sharing
Hi, you stated "We are using a new database, starting the sync from scratch.".
However, as per my earlier response - please confirm that in addition to removing the chaindb that you are also deleting the freezer db. The freezer db is not deleted when you perform a geth init
.
I suggest you read the section "Freezer tricks" in the geth 1.9 release notes.
As mentioned earlier in the thread, the invalid merkle root
error usually occurs if there is a db corruption or inconsistency. This is an issue in upstream geth, here are a few examples of issues raised for this:
Hi @SatpalSandhu61, thanks for the feedback,
I promise the directory was empty. However, I have repeated the process, on a newly created path, and the problem repeats: full sync mode hangs at block 8597100
.
I have considered the issues linked, and it seems that it's related with a problem in some versions of geth
official, and that it is solved as of version 1.9.23-Stable. However, GoQuorum 21.01 is based on v1.9.7; a long way from getting to the version that could fix the problem.
One last consideration: this is a permanent error, and always reproducible. Alastria has more than 100 active nodes, and the migration process to GoQuorum 20.xx / 21.xx is pending the results of these tests: any help will be appreciated.
{
admin: {
datadir: "/home/alastria/data-full",
nodeInfo: {
enode: "enode://beabec74344fc143c9585017c940a94f0b7915024de2d632222e0ef58a1e6c9b3520d2d3e1ada304ef5b1652ba679f2f9686190f83d89d5f81410d0a9680881e@46.27.166.130:21000?discport=0",
enr: "enr:-JC4QHN8R874S81ttpNdPBLM72SF4M0vgyBnSmyhfB9fBcKKXVH9EEfCYGD8-HFY1HTuy0QLzSNL2c7rzCq-a4PHKvgGg2V0aMfGhEXl0IiAgmlkgnY0gmlwhC4bpoKJc2VjcDI1NmsxoQK-q-x0NE_BQ8lYUBfJQKlPC3kVAk3i1jIiLg71ih5sm4N0Y3CCUgg",
id: "3713f5a6c14042c2483ede889f88e36ce70b870ada6087f45b41976527128e62",
ip: "46.X.Y.Z",
listenAddr: "[::]:21000",
name: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
plugins: {},
ports: {
discovery: 0,
listener: 21000
},
protocols: {
istanbul: {...}
}
},
peers: [],
[...]
eth: {
accounts: [],
blockNumber: 8597100,
coinbase: "0x9f88e36ce70b870ada6087f45b41976527128e62",
compile: {
lll: function(),
serpent: function(),
solidity: function()
},
defaultAccount: undefined,
defaultBlock: "latest",
gasPrice: 0,
hashrate: 0,
mining: false,
pendingTransactions: [],
protocolVersion: "0x63",
syncing: {
currentBlock: 8597100,
highestBlock: 61898986,
knownStates: 0,
pulledStates: 0,
startingBlock: 8597102
},
call: function(),
[...]
version: {
api: "0.20.1",
ethereum: "0x63",
network: "83584648538",
node: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
whisper: undefined,
getEthereum: function(callback),
getNetwork: function(callback),
getNode: function(callback),
getWhisper: function(callback)
},
Any way to get more faulty node information via the debug.traceBlock*
commands?
@alejandroalffer
Would it be possible to share some history of the network migrations with
Could you test to full-sync a node from scratch using lower GoQuorum versions (using official binaries) and let us know what is the highest GoQuorum version that passes block 8597100
. I would recommend starting with GoQuorum v2.5.0
(which is the latest GoQuorum version based on Geth 1.8.18
)
Thanks a lot.
Thanks @nmvalera , we will start to share what you say next week.
Hi @nmvalera ,
thanks for the feedback.
Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5
So far, there have been no updates since that version: only a few nodes are still on version 1.8.2. In fact, we are working on a renewal of the network, in which it is a fundamental part to take advantage of the advantages of the new versions of GoQuorum: bug fixes, monitoring, ...
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13"
> (finish ok)
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-20c95e5d(quorum-v2.4.0)/linux-amd64/go1.11.13"
> (finish ok)
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-685f59fb(quorum-v2.5.0)/linux-amd64/go1.11.13"
> (finish ok)
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-9339be03(quorum-v2.6.0)/linux-amd64/go1.13.10"
> (STOP SYNCING)
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)
v20.10.0 ADDED
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-af752518(quorum-v20.10.0)/linux-amd64/go1.13.15"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)
v21.1.0 ADDED
> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)
All the test are made under this enviroment:
root@alastria-01:~# ldd /usr/local/bin/geth
linux-vdso.so.1 (0x00007ffeb65e7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb6c3f64000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb6c3f59000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb6c3e0a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb6c3c18000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb6c3f8f000)
root@alastria-01:~# uname -a
Linux alastria-01 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@alastria-01:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
This is a permanent error, and always reproducible for every new node of Alastria - GoQuorum network.
Thanks for the help to the entire GoQuorum team at Consensys by supporting Alastria-T network
Thanks this helps a lot.
Can you do the following:
In the meantime, we will dig into the error.
Thanks @nmvalera!
Yes! We've have already tested the proposed workarround: once the database it's fully syncronized (with a version previous of v2.6.0), the binary can be upgraded without problem (just minor changes for metrics arguments). It also works for a direct upgrade from v2.5.0 to v21.1.0.
Please, keep the effort in searching for a solution: we are looking that the new Alastria partners can perform a direct synchronization of the nodes in full mode to maintain the "trust" of the network, using lastest versions of GoQuorum.
Thanks, we are discussing this internally and we will keep you updated (we may require some more information from you at some point, we'll let you know).
@alejandroalffer please also review the migration docs for upgrading from earlier versions of Quorum to 2.6.0 and above. Bad block can sometimes be caused by not setting istanbulBlock
and petersburgBlock
in the genesis.json
so it will be good to eliminate that as a possibility.
(EDIT)
To summarise, please try a full sync with istanbulBlock
and petersburgBlock
in genesis.json
so we can eliminate the possibility that this is the cause of the bad block. For now you can set them to some abritrary block very far in the future. The values can be updated later when you have an idea of when the network will be ready to move to these forks.
@alejandroalffer Could you please confirm that the version that you are looking to migrate from
Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5
is not an official GoQuorum version but I imagine an Alastria's own custom fork?
Thanks.
@alejandroalffer @cmoralesdiego
Any news on the 2 topics above
?
Thanks a lot.
Hi @nmvalera , we are going to give you feedback next week on the early week. Thanks in advance
Hi @nmvalera , @cmoralesdiego
Sorry for the delay. I've tried to restart the synchronization in full mode using different values for the istanbulBlock
parameter, but always with the same result: the process stops in the block of hell ;-)
root@alastria-01:/home/iadmin# diff /root/genesis.json-original /root/genesis.json
18c18,20
< "policy": 0
---
> "policy": 0,
> "petersburgBlock": 10000000,
> "istanbulBlock": 10000000
I've tryed some values... from setting it to 0
or the last one, over the block 8597101
, with the same result
The logs shows hundred of messages like VM returned with error err="evm: execution reverted"
prior of failure
root@alastria-01:~# md5sum /tmp/log.v21.1.0.txt.gz
e10f9eb8bfd584deaad2267f9c6da791 /tmp/log.v21.1.0.txt.gz
On the other hand, there was a fork for Alastria network, with minor updates in order to improve reporting in EthNetStats, but later versions, based on the same version of geth and new releases of GoQuorum finish the synchronization in full mode without problem:
Geth v1.8.18 · GoQuorum v2.2.3 - Alastria version, finish
Geth v1.8.18 · GoQuorum v2.4.0 - Official version, finish
Geth v1.8.18 · GoQuorum v2.5.0 - Official version, finish
Geth v1.9.7 · GoQuorum v2.6.0 - Official version, fails
Geth v1.9.7 · GoQuorum v20.10.0 - Official version, fails
Geth v1.9.7 · GoQuorum v21.1.0 - Official version, fails
IMHO, the problem appears in upgrade from Geth1.8.18 to Geth1.9.7
Best regards!
@alejandroalffer from your log I see
INFO [02-22|10:56:30.757] Initialised chain configuration config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: <nil> Istanbul: <nil> PrivacyEnhancements: <nil> Engine: istanbul}"
Petersburg: <nil> Istanbul: <nil>
suggests your updated genesis is not being used. Please make sure you run geth init /path/to/updated/genesis.json
to apply the genesis updates before attempting a resync. As a comparison, I see the following in my logs when starting a node with these values set in my genesis.json:
INFO [02-23|10:49:35.554] Initialised chain configuration config="{ChainID: 720 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 100 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 100 Istanbul: 100 PrivacyEnhancements: <nil> Engine: istanbul}"
In addition to setting istanbulBlock
and petersburgBlock
, you may want to try also setting constantinopleBlock
.
To be clear, the values you set for these fork blocks should be a future block, otherwise you will be processing old transactions with the new protocol features these settings enable.
(EDIT: 24 Feb) See the sample genesis in quorum-examples for an example of how to configure these fork blocks.
Thanks for the feedback, @chris-j-h:
You were right: i've been using istanbulBlock
, petersburgBlock
parameters inside istanbul {}
object. I've repeated the test in config {}
object adding constantinopleBlock
, as suggested, with different values... the last one far away from the end of the chain with the same result: the sync process fails in full mode:
Some logs...
INFO [02-24|23:24:05.143] Initialised chain configuration config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 100000000 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 100000000 Istanbul: 100000000 PrivacyEnhancements: <nil> Engine: istanbul}"
Alastria network its in block ~63.000.000
, and i've used 100.000.000
argument:
> eth.blockNumber
63133729
root@alastria-01:~# cat genesis.json
{
"alloc": {
"0x58b8527743f89389b754c63489262fdfc9ba9db6": {
"balance": "1000000000000000000000000000"
}
},
"coinbase": "0x0000000000000000000000000000000000000000",
"config": {
"chainId": 83584648538,
"byzantiumBlock": 0,
"homesteadBlock": 0,
"eip150Block": 0,
"eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"eip155Block": 0,
"eip158Block": 0,
"istanbulBlock": 100000000 ,
"petersburgBlock": 100000000,
"constantinopleBlock": 100000000,
"istanbul": {
"epoch": 30000,
"policy": 0,
"petersburgBlock": 0,
"istanbulBlock": 0
},
"isQuorum": true
},
"extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f85ad594b87dc349944cc47474775dde627a8a171fc94532b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
"gasLimit": "0x2FEFD800",
"difficulty": "0x1",
"mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
"nonce": "0x0",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"timestamp": "0x00"
}
The start script:
VER="v21.1.0"
export PRIVATE_CONFIG="ignore"
/usr/local/bin/geth --datadir /home/alastria/data-${VER} --networkid 83584648538 --identity REG_DigitelTS-labs_2_2_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --ethstats REG_DigitelTS-labs_2_2_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --verbosity 3 --vmdebug --emitcheckpoints --targetgaslimit 8000000 --syncmode full --gcmode full --vmodule consensus/istanbul/core/core.go=5 --nodiscover --cache 4096 2> /tmp/log.${VER}
And the result:
pi@deckard:~ $ md5sum log.v21.1.0.gz
8a5d2b1355b3e0c0690e2aafa263781f log.v21.1.0.gz
[log.v21.1.0.gz](https://github.com/ConsenSys/quorum/files/6040662/log.v21.1.0.gz)
There's another point, and maybe its not relevant: using values 0
the fail happens in earlier block, 48704
, with different error:
[...]
INFO [02-25|05:33:42.297] Initialised chain configuration config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 0 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 0 Istanbul: 0 PrivacyEnhancements: <nil> Engine: istanbul}"
[...]
INFO [02-25|05:34:22.010] Imported new chain segment blocks=2048 txs=0 mgas=0.000 elapsed=1.485s mgasps=0.000 number=47680 hash=5c79ea…3aae91 age=2y1mo6d dirty=0.00B
ERROR[02-25|05:34:22.736]
########## BAD BLOCK #########
Chain config: {ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 0 TransactionSizeLimit: 64 MaxCodeSize: 24 Petersburg: 0 Istanbul: 0 PrivacyEnhancements: <nil> Engine: istanbul}
Number: 48704
Hash: 0x9f7f3734ad532365a2f2e10fe8f9c308d0d45ac1e018742a676fd20ce6a5f75b
0: cumulative: 22032 gas: 22032 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xf02ea502f6c171789bfcb686e468ad2adde0a710e66ce41155c25af30c9ac633 logs: [] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 state:
1: cumulative: 44064 gas: 22032 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0x942736bd1648bfce11e578aeff59ee05bc7f2d220dedfda0e97da4d36d1c123e logs: [] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 state:
Error: invalid gas used (remote: 48432 local: 44064)
##############################
WARN [02-25|05:34:22.740] Synchronisation failed, dropping peer peer=a2054ebfafb0f0f5 err="retrieved hash chain is invalid"
ERROR[02-25|05:34:30.838]
Thanks again for not giving up!
Best regards!
Making the log from @alejandroalffer's above comment https://github.com/ConsenSys/quorum/issues/1107#issuecomment-785637265 clickable: log.v21.1.0.gz
@alejandroalffer as you noted earlier your logs have a large number of VM returned with error "evm: execution reverted"
msgs when doing a full sync.
There are also quite a few other VM returned with error
msgs (see full list below)
Do you see any of these when doing a full sync from block 0 with your current Alastria version or pre-v2.6.0 Quorum?
count | VM returned with error |
---|---|
11940 | "evm: execution reverted" |
190 | "out of gas" |
126 | "stack underflow (0 <=> 1)" |
23 | "evm: max code size exceeded" |
21 | "contract creation code storage out of gas" |
6 | "stack underflow (0 <=> 13)" |
55 | "invalid opcode 0x1b" |
21 | "invalid opcode 0xfe" |
20 | "invalid opcode 0x4f" |
16 | "invalid opcode 0x1c" |
9 | "invalid opcode 0x27" |
7 | "invalid opcode 0xef" |
6 | "invalid opcode 0xa9" |
3 | "invalid opcode 0xd2" |
3 | "invalid opcode 0xda" |
Hi!
I've used GoQuorum v2.5.0: the last version in which full synconization finish right. As you know, its based in Geth v1.8.18.
The "bad block" is reached and passed, and the logs in VM returned with error
seems quite similar in format and in number:
root@alastria-01:/tmp# zcat log.v2.5.0.gz |grep "VM returned with error"|cut -f3- -d" "|sort|uniq -c
20 VM returned with error err="contract creation code storage out of gas"
22267 VM returned with error err="evm: execution reverted"
21 VM returned with error err="evm: max code size exceeded"
106 VM returned with error err="invalid opcode 0x1b"
27 VM returned with error err="invalid opcode 0x1c"
21 VM returned with error err="invalid opcode 0x23"
7 VM returned with error err="invalid opcode 0x27"
17 VM returned with error err="invalid opcode 0x4f"
4 VM returned with error err="invalid opcode 0xa9"
3 VM returned with error err="invalid opcode 0xd2"
2 VM returned with error err="invalid opcode 0xda"
7 VM returned with error err="invalid opcode 0xef"
17 VM returned with error err="invalid opcode 0xfe"
3823 VM returned with error err="out of gas"
130 VM returned with error err="stack underflow (0 <=> 1)"
6 VM returned with error err="stack underflow (0 <=> 13)"
2 VM returned with error err="stack underflow (0 <=> 3)"
The full log here, log.v2.5.0.gz
root@alastria-01:/tmp# md5sum log.v2.5.0.gz
505f207b66846dc4e20170cd70bd7561 log.v2.5.0.gz
BTW... the process hangs near block 10.000.000, because invalid gas used
. I've used a genesis.json with istanbulBlock
, petersburgBlock
and constantinopleBlock
setted to this value, but let's keep focus in merkle tree
error.
[...]
"istanbulBlock": 10000000,
"petersburgBlock": 10000000,
"constantinopleBlock": 10000000,
[...]
Thanks again!
@alejandroalffer said:
BTW... the process hangs near block 10.000.000, because
invalid gas used
. I've used a genesis.json withistanbulBlock
,petersburgBlock
andconstantinopleBlock
setted to this value, but let's keep focus inmerkle tree
error.[...] "istanbulBlock": 10000000, "petersburgBlock": 10000000, "constantinopleBlock": 10000000, [...]
These values should be a future block that hasn't been seen yet. In an earlier comment you said you set the values to 100,000,000
. That should fix your problem.
Hi @alejandroalffer
Block 8,597,101 contains 2 txs sent to 0x4F541bab8aD09638D28dAB3b25dafb64830cE96C
which both execute method 0xd30528f2
(from the tx input
).
I was unable to get a list of all txs to this contract on your block explorer. Do you know if this is the first block where this method on the contract is executed?
Does your network only use public transactions?
Let’s try and track down exactly where the state is deviating from what is expected:
On both:
can you do the following:
debug.dumpBlock('0x832e6c')
and compare outputs.
This is block 8,597,100 and will allow us to confirm that both are starting from the same point. Because this block was able to sync I expect they will be the same.
If there is too much output you can try debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6c')
which will only return the state dump for the contract involved in the problem block.
debug.dumpBlock('0x832e6d')
and compare outputs.
This is block 8,597,101. Again you can try debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6d')
if there is too much output.
If you get block not found
on the upgraded node that's fine.
On the upgraded node that is failing to sync can you do the following:
> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')
[...]
If you see something like structLogs: [{...}, {...},
in the output also do:
> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs
[...]
> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs
[...]
to get the full trace for both transactions in the block.
On the non-upgraded node can you do the following:
> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')
Again if the output has something like structLogs: [{...}, {...},
in the output also do:
> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs
[...]
> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs
[...]
to get the full trace for both transactions in the block.
If you can share each of these outputs we can do some comparisons and see where the state is deviating.
If you run into any problems it may be easier to discuss on the Quorum Slack. Feel free to msg me if needed.
Hi @chris-j-h , and the rest of GoQuorum team...
Answering the questions, i make a summary:
1) I wasn't able to execute debug.dumpBlock('0x832e6c') as suggested. The access to the command its allowed only from RCP (because of a bug in consoles). Both GoQuorum 1.8.18 and GoQuorum 2.5 the process hangs after 5-7 minutes waiting, despite 24Gb dedicated servers
2) The input transaction for 8597101 block seems the same in both versions...
=== LOG GoQuorum 2.5
/geth.ipc --exec "admin.nodeInfo.name" > /tmp/log-2.5.0.txt
/geth.ipc --exec " debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs" >> /tmp/log-2.5.0.txt
/geth.ipc --exec " debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs" >> /tmp/log-2.5.0.txt
=== LOG GoQuorum 21.01
/geth.ipc --exec "admin.nodeInfo.name" > /tmp/log-21.01.txt
/geth.ipc --exec "debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs" >> /tmp/log-21.01.txt
/geth.ipc --exec "debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs" >> /tmp/log-21.01.txt
ladmin@DESKTOP-UK0SQ8D:~$ diff log-2*
1c1
< "Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-685f59fb(quorum-v2.5.0)/linux-amd64/go1.11.13"
---
> "Geth/REG_DigitelTS-dev_2_8_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5"
The full log:
fba2df51782905ff1516d85b2ac25ac4 /tmp/log-2.5.0.txt.gz log-2.5.0.txt.gz
37eb330b4468888c117bee742f180051 /tmp/log-21.01.txt.gz log-21.01.txt.gz
Keep in touch!
Thanks again!
Hi!
I update the status of the issue with the news of this week, to share it with the Alastria and Conensys teams.
We made some search in the chaindb (thanks for the snippet, @chris-j-h), in order to find if the method used in the badblock, 0xd30528f2
for (i = 1; i < 9999999; i++) {
hexBlockNumber = "0x" + i.toString(16)
txs = eth.getBlockByNumber(hexBlockNumber, true).transactions
for (j = 0; j < txs.length; j++) {
if (txs[j].input.slice(0, 10) == "0xd30528f2") {
console.log("tx calling method 0xd30528f2 found in block " + i)
}
}
}
The result it's that this trasacction appears several blocks behind, for 657 in total:
tx calling method 0xd30528f2 found in block 7809408
[...]
tx calling method 0xd30528f2 found in block 9231310
It seems that this transaction it's not related with the synchronization problem :-(
As resume, by @chris-j-h:
[...] we’ve done a lot of investigations so far to identify the cause of the diverging state and unfortunately not had much luck. to summarise some of the results: • dump full state using API to compare: state is too large, results in out-of-memory error and crashes node • dump just state for the contract called by the transactions in block 8597101: bug in API for pre-2.6.0 quorum, doesn’t cover any contracts that might be called by the initial contract • transactions to the same contract method have been called previously so there is nothing inherently broken with the contract due to the quorum upgrade [...]
We'll keep searching for the out-of-memory crashes in order the get the results from RPC API:
debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6c') and debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6d')
And this references:
Any other suggestion will be also appreciated
Thanks again @chris-j-h!
I'm assuming this has been fixed now, feel free to re-open if that is not the case
It is not the case, I'm facing the same issue in a different quorum network.
Can you raise a fresh ticket with genesis and param details. Along with the exact version it stopped working at.
Root cause identified (at least one of the consensus issues that cause invalid merkle[state] root error): Quorum version 2.1.0 marks account as dirty every time object is created: https://github.com/ConsenSys/quorum/blob/99a83767ccf0384a3b58d9caffafabb5b49bd73c/core/state/statedb.go#L407-L408
Quorum version 2.7+ and at least up to version 21.10.2 marks account as dirty only if it was NOT deleted in the same block:
It still has a comment "newobj.setNonce(0) // sets the object to dirty", but that function doesn't mark obj as dirty anymore, instead it happens in the journal.append
, but only for creation of object, not resetting:
https://github.com/ConsenSys/quorum/blob/cd11c38e7bc0345a70ef85a8b085e7755bb0ee78/core/state/statedb.go#L695-L701
In our case bug manifested due to multiple ecrecover's in the same block. After the first call to 0x1 account it is added to the state, then removed after the transaction as empty (same behavior for both nodes). After the second call to 0x1 account, it is added to the state, then in the old node it is marked as dirty and removed after the tx, while the newer node does not mark it as dirty, and leaves it in the state, which results in the different final states.
And if someone will stumble here looking for a fix, here it is: https://github.com/ConsenSys/quorum/compare/master...Ambisafe:quorum:21.10.2-fix
@lastperson good spot - do you want to submit a pull request?
@antonydenyer I'am now checking if the latest node will sync with the fixed version, or if it has the same issue. If it will have the same issue, then I guess this fix could only be introduced as a fork configuration.
I'm testing uploading the Alastria Quorum (v1.8.18) to new version (v2.7.0 and v20.10.0).
But the chain syncronization fails, in full mode. Fast mode finish right.
We use the well-know genesis node for Alastria node:
https://github.com/alastria/alastria-node/blob/testnet2/data/genesis.json
And the command line looks like:
But can't get past block 8597100. It happens in these both version upgrades we are testing:
The log its almost the same in both versions:
This problem does not happen with the current stable Alastria version: the full syncronization finish right:
Is necessary to be able to recreate the chain in full mode before in order to upgrade the network clients.
Full log of fail sincronization, until "BAD BLOCK" message in:
FULL LOG log.err.txt.gz
FULL LOG
Related links: