Alastria Network · Full sync failed · GoQuorum v2.6.0 (and following)

alejandroalffer commented 3 years ago

I'm testing uploading the Alastria Quorum (v1.8.18) to new version (v2.7.0 and v20.10.0).

But the chain syncronization fails, in full mode. Fast mode finish right.

We use the well-know genesis node for Alastria node:

https://github.com/alastria/alastria-node/blob/testnet2/data/genesis.json

And the command line looks like:

geth --datadir /root/alastria/data --networkid 83584648538 --identity VAL_DigitelTS_T_2_8_01 --permissioned --cache 10 --rpc --rpcaddr 127.0.0.0 --rpcapi admin,db,eth,debug,miner,net,shh,txpool,personal,web3,quorum,istanbul --rpcport 22000 --port 21000 --istanbul.requesttimeout 10000 --ethstats VAL_DigitelTS_T_2_8_01:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --verbosity 3 --emitcheckpoints --targetgaslimit 8000000 --syncmode full --gcmode full --vmodule consensus/istanbul/core/core.go=5 --debug --vmdebug --nodiscover --mine --minerthreads 2

But can't get past block 8597100. It happens in these both version upgrades we are testing:

> admin.nodeInfo.name
"Geth/VAL_DigitelTS_T_2_8_01/v1.9.7-stable-6005360c(quorum-v2.7.0)/linux-amd64/go1.15.2"
> admin.nodeInfo.name
"Geth/VAL_DigitelTS_T_2_8_01/v1.9.7-stable-af752518(quorum-v20.10.0)/linux-amd64/go1.15.2"

The log its almost the same in both versions:

Number: 8597101
Hash: 0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5
         0: cumulative: 48864 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0x5136041eb879d49699e76bf64aed8207376cd0d1f42aa20d80613bad309bece4 logs: [0xc0004bc790 0xc0004bc840]
bloom: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000000100000000000000000000000040000000000000000000000000000000000000000000000000
0000000000100000000000000000000000000000000000000002000000000000000200000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:
         1: cumulative: 97728 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xb0e8e529893614560fcd421310d68cd03794fe8a22e36d5140ba6cde5b4300af logs: [0xc0004bc8f0 0xc0004bc9a0]
bloom: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000000100000000000000000000000040000000000000000000000000000000000000000000000000
00000000001000000000000000000000000000000000000000020000000000000002000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:
Error: invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)
##############################
WARN [11-09|15:18:19.392|eth/downloader/downloader.go:336] Synchronisation failed, dropping peer    peer=d94e041d29046c47 err="retrieved hash chain is invalid"
ERROR[11-09|15:18:21.002|core/blockchain.go:2214]
########## BAD BLOCK #########
Chain config: {ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> TransactionSizeLimit: 0 MaxCodeSize: 24 Petersburg:
 <nil> Istanbul: <nil> Engine: istanbul}

This problem does not happen with the current stable Alastria version: the full syncronization finish right:

Geth/VAL_DigitelTS_T_2_8_00/v1.8.18-stable(quorum-v2.2.2)/linux-amd64/go1.9.5

Is necessary to be able to recreate the chain in full mode before in order to upgrade the network clients.

Full log of fail sincronization, until "BAD BLOCK" message in:

FULL LOG log.err.txt.gz

rwxrwxrwx 1 ladmin ladmin 44902 Dec 15 13:11 log.err.txt.gz

FULL LOG

Related links:

nmvalera commented 3 years ago

@alejandroalffer I see references to go1.15.2, making me presume that you have been compiling from sources, could you please instead use either

Official GoQuorum binary distribution on Bintray: https://bintray.com/quorumengineering/quorum/download_file?file_path=v20.10.0/geth_v20.10.0_linux_amd64.tar.gz
Official Docker images: https://hub.docker.com/r/quorumengineering/quorum

Then retry and let us know if the issue persists.

Thanks

alejandroalffer commented 3 years ago

Thanks for the feedback, @nmvalera!

In fact, current geth version is compiled from sources, in a Dockerized ubuntu:18.04

root@e8cdc103b174:~# geth  version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.15.2
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

root@e8cdc103b174:~# ldd  /usr/local/bin/geth
        linux-vdso.so.1 (0x00007ffebaad8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb5fc23b000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb5fc033000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb5fbc95000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb5fb8a4000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb5fc45a000)

Downloading geth as proposed:

root@e8cdc103b174:~# md5sum /usr/local/bin/geth-from-sources /usr/local/bin/geth
40c6fe1443d4294824de5ff4f58ce855  /usr/local/bin/geth-from-sources
b68db91b96b1808daa24bb27b000aeb4  /usr/local/bin/geth

root@e8cdc103b174:~# /usr/local/bin/geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64
root@e8cdc103b174:~# ldd /usr/local/bin/geth
        linux-vdso.so.1 (0x00007fff4aff8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb2d68ec000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb2d66e4000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb2d6346000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb2d5f55000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb2d6b0b000)

It returns the same issue:

WARN [01-12|15:36:51.451|eth/downloader/downloader.go:336]       Synchronisation failed, dropping peer    peer=ae385305ccad4d03 err="retrieved hash chain is invalid"
ERROR[01-12|15:36:52.495|core/blockchain.go:2214]
########## BAD BLOCK #########
Chain config: {ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> TransactionSizeLimit: 64 MaxCodeSize: 24 Petersburg: <nil> Istanbul: <nil> PrivacyEnhancements: <nil> Engine: istanbul}

Number: 8597101
Hash: 0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5
         0: cumulative: 48864 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0x5136041eb879d49699e76bf64aed8207376cd0d1f42aa20d80613bad309bece4 logs: [0xc0003e13f0 0xc0003e14a0] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000000001000000000000000000000000400000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000002000000000000000200000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:
         1: cumulative: 97728 gas: 48864 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xb0e8e529893614560fcd421310d68cd03794fe8a22e36d5140ba6cde5b4300af logs: [0xc0003e1550 0xc0003e1600] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000000001000000000000000000000000400000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000002000000000000000200000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000400000000000000000000000000000 state:

Error: invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)
##############################

WARN [01-12|15:36:52.522|eth/downloader/downloader.go:336]       Synchronisation failed, dropping peer    peer=e01dc34eba4860ea err="retrieved hash chain is invalid"
^CINFO [01-12|15:36:53.444|cmd/utils/cmd.go:75]                    Got interrupt, shutting down...
INFO [01-12|15:36:53.444|node/node.go:443]                       http endpoint closed                     url=http://127.0.0.0:22000
INFO [01-12|15:36:53.445|node/node.go:373]                       IPC endpoint closed                      url=/root/alastria/data/geth.ipc
INFO [01-12|15:36:53.445|core/blockchain.go:888]                 Blockchain manager stopped
INFO [01-12|15:36:53.445|eth/handler.go:291]                     Stopping Ethereum protocol
INFO [01-12|15:36:53.446|eth/handler.go:314]                     Ethereum protocol stopped
INFO [01-12|15:36:53.446|core/tx_pool.go:408]                    Transaction pool stopped
INFO [01-12|15:36:53.446|ethstats/ethstats.go:131]               Stats daemon stopped

I'll try:

starting over a new chaindb with "geth removedb_DONOTITACCIDENTALY"
using the https://hub.docker.com/r/quorumengineering/quorum, as proposed

cmoralesdiego commented 3 years ago

@nmvalera , do you think that we have the same problem with the chain as we have showed you on the issue: https://github.com/ConsenSys/quorum/issues/1108?

alejandroalffer commented 3 years ago

Hi!

The problem keeps while full syncing using the provided binary :-( . Using fast mode, everything finish in the right way

export PRIVATE_CONFIG=ignore

geth --datadir /root/alastria/data --networkid 83584648538 --identity VAL_DigitelTS_T_2_8_01 --permissioned --cache 4096 --port 21000 --istanbul.requesttimeout 10000 --ethstats VAL_DigitelTS_T_2_8_01:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --verbosity 3 --emitcheckpoints --targetgaslimit 8000000 --syncmode full --vmodule consensus/istanbul/core/core.go=5 --debug --vmdebug --nodiscover --mine --minerthreads 2

This was a fresh database, after "geth removedb --datadir /root/alastria/data_DONOTCOPYPASTER" and "geth --datadir /root/alastria/data init /root/genesis.json", and restoring the original enode key.

root@e8cdc103b174:~# cat /root/genesis.json #the standar in Alastria
{
  "alloc": {
    "0x58b8527743f89389b754c63489262fdfc9ba9db6": {
      "balance": "1000000000000000000000000000"
    }
  },
  "coinbase": "0x0000000000000000000000000000000000000000",
  "config": {
    "chainId": 83584648538,
    "byzantiumBlock": 0,
    "homesteadBlock": 0,
    "eip150Block": 0,
    "eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
    "eip155Block": 0,
    "eip158Block": 0,
    "istanbul": {
      "epoch": 30000,
      "policy": 0
    },
    "isQuorum": true
  },
  "extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f85ad594b87dc349944cc47474775dde627a8a171fc94532b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
  "gasLimit": "0x2FEFD800",
  "difficulty": "0x1",
  "mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
  "nonce": "0x0",
  "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
  "timestamp": "0x00"
}

The binary file you provided:

root@e8cdc103b174:~# /usr/local/bin/geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64
root@e8cdc103b174:~# md5sum /usr/local/bin/geth
b68db91b96b1808daa24bb27b000aeb4  /usr/local/bin/geth

The binary compiled from myself:

root@e8cdc103b174:~# /usr/local/bin/geth-from-sources version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.15.2
Operating System: linux
GOPATH=
GOROOT=/usr/local/go

Could any of these files help in the solution?

[...]
-rw-r--r-- 1 root root   2146445 Jan 14 07:31 010075.ldb
-rw-r--r-- 1 root root 180033388 Jan 14 07:39 010077.ldb
-rw-r--r-- 1 root root   1663848 Jan 14 07:39 MANIFEST-000004
-rw-r--r-- 1 root root 136328749 Jan 14 08:41 010076.log
-rw-r--r-- 1 root root    555890 Jan 14 08:41 LOG
[...]

In order to make the same test using the Docker provided by quorum... could I have access to the original Dockerfile used in https://hub.docker.com/r/quorumengineering/quorum?

alejandroalffer commented 3 years ago

@nmvalera , do you think that we have the same problem with the chain as we have showed you on the issue: #1108?

I'm pretty sure the problem it's diferent: this one it's about syncing a new node in the full-way mode, and the @carlosho17 issue it's related with the new storage model for chain database

alejandroalffer commented 3 years ago

Better degub attached:

root@e8cdc103b174:~# geth version
Geth
Version: 1.9.7-stable
Git Commit: af7525189f2cee801ef6673d438b8577c8c5aa34
Quorum Version: 20.10.0
Architecture: amd64
Protocol Versions: [64 63]
Network Id: 1337
Go Version: go1.13.15
Operating System: linux
GOPATH=
GOROOT=/opt/hostedtoolcache/go/1.13.15/x64

Geth arguments:

geth --datadir /root/alastria/data --networkid 83584648538 --identity VAL_DigitelTS_T_2_8_01 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats VAL_DigitelTS_T_2_8_01:_DONOT_SHOW@_DONOT_SHOW:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password NO_CLEAN --metrics.influxdb.tags host=VAL_DigitelTS_T_2_8_01 --verbosity 5 --cache 10 --nousb --maxpeers 200 --nousb

Log error aflter a fresh chaindb install err.full.gz

SatpalSandhu61 commented 3 years ago

Looking at the logs I notice that you haven't cleared the freezer db:

INFO [01-16|08:48:01.963] Opened ancient database                  database=/root/alastria/data/geth/chaindata/ancient
DEBUG[01-16|08:48:01.964] Ancient blocks frozen already            number=8597100 hash=e4d6ea…6ca9ca frozen=5434860

So you're getting the BAD BLOCK on the first block your node is trying to download during the sync (block 8597101). It may be worthwhile performing freezer delete in addition to the chaindb removal. So that you start with a completely clean node.

alejandroalffer commented 3 years ago

Thank you for the answer, @SatpalSandhu61. The problem persist after a clean up of the chaindb:

imagen

The logs starts in the incorrect number, after restarting geth, only for a make it smaller.

SatpalSandhu61 commented 3 years ago

I believe you may not have fully understood my comment regarding clearing the freezer db. Please read the section on freezer, which was introduced with the merge from v1.9.7 upstream geth:https://blog.ethereum.org/2019/07/10/geth-v1-9-0/

carlosho17 commented 3 years ago

Hi

just to recap on this issue for the Alastria Quorum Network.

We all stumble upon a certain block when using geth 1.9.7 that yields this message

DEBUG[01-15|11:28:01.360] Downloaded item processing failed number=8597101 hash=e4a2d7…49b2e5 err="invalid merkle root (remote: 0f6d6606b447b6fd26392f999e84be08fdf8b71f956b83116017dbb371ea1f1a local: 8a6cab008e2572a774a3c1eadc36269fa65662471c088652853db94e38ff8e59)"

We have spent last weeks trying all scenarios (fast and full sync, erasing whole data directory and reinitializing with geth init preserving nodekey, fresh new installations, different Ubuntu versions, Quorum tgz package, in-place compilation with Go 1.15 and 1.13 , etc). These tests have been performed not only by us Core Team but also by regular members.

The result is always the same: It is block 8597101 where newer quorum finds a bad merkle root and stops syncing.

Our workaround is: install older version , let it sync past beyond block 8597101, and then switch to quorum 20.10.x . There is a second workaround which is start fresh but with a copy of the chain past beyond the bad block.

What we would like to know if the finding of a bad merkle-root by the quorum 20.10.x is a feature or a bug.

Thank you

alejandroalffer commented 3 years ago

The problem still persists anyway in the new version 21.1.0: the sync process stop forever in block 8597100, using full mode.

We are using a new database, starting the sync from scratch. The problem is repeated in all cases:

$ export PRIVATE_CONFIG=ignore
$ geth --datadir /root/alastria/data --networkid 83584648538 --identity BOT_DigitelTS_T_2_8_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats BOT_DigitelTS_T_2_8_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password ala0str1AX1 --metrics.influxdb.tags host=BOT_DigitelTS_T_2_8_00 --verbosity 5 --cache 8192 --nousb --maxpeers 256

instance: Geth/VAL_DigitelTS_T_2_8_01/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5

> eth.syncing
{
  currentBlock: **8597100**,
  highestBlock: 61148125,
  knownStates: 0,
  pulledStates: 0,
  startingBlock: 0
}

I have created a new log file with the last lines: they are repeated forever.

In order to progress with this problem, we could allow an enode adress in for developers to do their own testing.

The Alastria ecosystem, with more than 120 nodes, is pending this issue to proceed with the version migration.

Last few lines from linux console: sync-fails.txt

alejandroalffer commented 3 years ago

Full trace from start of the synchronization:

geth --datadir /root/alastria/data --networkid 83584648538 --identity BOT_DigitelTS_T_2_8_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --port 21000 --ethstats BOT_DigitelTS_T_2_8_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --targetgaslimit 8000000 --syncmode full --nodiscover --metrics --metrics.expensive --pprof --pprofaddr 0.0.0.0 --pprofport 9545 --metrics.influxdb --metrics.influxdb.endpoint http://geth-metrics.planisys.net:8086 --metrics.influxdb.database alastria --metrics.influxdb.username alastriausr --metrics.influxdb.password ala0str1AX1 --metrics.influxdb.tags host=BOT_DigitelTS_T_2_8_00 --verbosity 5 --cache 8192 --nousb --maxpeers 256 --vmdebug 2> /root/alastria/data/full_sync

https://drive.google.com/file/d/1rx7bzJdygwomRBMfRn3Bftczf6nwuAeJ/view?usp=sharing

SatpalSandhu61 commented 3 years ago

Hi, you stated "We are using a new database, starting the sync from scratch.". However, as per my earlier response - please confirm that in addition to removing the chaindb that you are also deleting the freezer db. The freezer db is not deleted when you perform a geth init. I suggest you read the section "Freezer tricks" in the geth 1.9 release notes.

As mentioned earlier in the thread, the invalid merkle root error usually occurs if there is a db corruption or inconsistency. This is an issue in upstream geth, here are a few examples of issues raised for this:

alejandroalffer commented 3 years ago

Hi @SatpalSandhu61, thanks for the feedback,

I promise the directory was empty. However, I have repeated the process, on a newly created path, and the problem repeats: full sync mode hangs at block 8597100.

I have considered the issues linked, and it seems that it's related with a problem in some versions of geth official, and that it is solved as of version 1.9.23-Stable. However, GoQuorum 21.01 is based on v1.9.7; a long way from getting to the version that could fix the problem.

One last consideration: this is a permanent error, and always reproducible. Alastria has more than 100 active nodes, and the migration process to GoQuorum 20.xx / 21.xx is pending the results of these tests: any help will be appreciated.

{
  admin: {
    datadir: "/home/alastria/data-full",
    nodeInfo: {
      enode: "enode://beabec74344fc143c9585017c940a94f0b7915024de2d632222e0ef58a1e6c9b3520d2d3e1ada304ef5b1652ba679f2f9686190f83d89d5f81410d0a9680881e@46.27.166.130:21000?discport=0",
      enr: "enr:-JC4QHN8R874S81ttpNdPBLM72SF4M0vgyBnSmyhfB9fBcKKXVH9EEfCYGD8-HFY1HTuy0QLzSNL2c7rzCq-a4PHKvgGg2V0aMfGhEXl0IiAgmlkgnY0gmlwhC4bpoKJc2VjcDI1NmsxoQK-q-x0NE_BQ8lYUBfJQKlPC3kVAk3i1jIiLg71ih5sm4N0Y3CCUgg",
      id: "3713f5a6c14042c2483ede889f88e36ce70b870ada6087f45b41976527128e62",
      ip: "46.X.Y.Z",
      listenAddr: "[::]:21000",
      name: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
      plugins: {},
      ports: {
        discovery: 0,
        listener: 21000
      },
      protocols: {
        istanbul: {...}
      }
    },
    peers: [],
[...]
  eth: {
    accounts: [],
    blockNumber: 8597100,
    coinbase: "0x9f88e36ce70b870ada6087f45b41976527128e62",
    compile: {
      lll: function(),
      serpent: function(),
      solidity: function()
    },
    defaultAccount: undefined,
    defaultBlock: "latest",
    gasPrice: 0,
    hashrate: 0,
    mining: false,
    pendingTransactions: [],
    protocolVersion: "0x63",
    syncing: {
      currentBlock: 8597100,
      highestBlock: 61898986,
      knownStates: 0,
      pulledStates: 0,
      startingBlock: 8597102
    },
    call: function(),
[...]
  version: {
    api: "0.20.1",
    ethereum: "0x63",
    network: "83584648538",
    node: "Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5",
    whisper: undefined,
    getEthereum: function(callback),
    getNetwork: function(callback),
    getNode: function(callback),
    getWhisper: function(callback)
  },

Any way to get more faulty node information via the debug.traceBlock* commands?

nmvalera commented 3 years ago

@alejandroalffer

Would it be possible to share some history of the network migrations with
- GoQuorum version the network was running (including commit hashes)
- different upgrades operations the network went through
- etc.
Could you test to full-sync a node from scratch using lower GoQuorum versions (using official binaries) and let us know what is the highest GoQuorum version that passes block 8597100. I would recommend starting with GoQuorum v2.5.0 (which is the latest GoQuorum version based on Geth 1.8.18)

Thanks a lot.

cmoralesdiego commented 3 years ago

Thanks @nmvalera , we will start to share what you say next week.

alejandroalffer commented 3 years ago

Hi @nmvalera ,

thanks for the feedback.

Most of the nodes in Alastria Network are running:

Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5

So far, there have been no updates since that version: only a few nodes are still on version 1.8.2. In fact, we are working on a renewal of the network, in which it is a fundamental part to take advantage of the advantages of the new versions of GoQuorum: bug fixes, monitoring, ...

I've tested the binary versions listed in https://github.com/ConsenSys/quorum/releases, and the "full sync" starts failing in version v2.6.0

v2.3.0

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13"
> (finish ok)

v2.4.0

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-20c95e5d(quorum-v2.4.0)/linux-amd64/go1.11.13"
> (finish ok)

v2.5.0

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-685f59fb(quorum-v2.5.0)/linux-amd64/go1.11.13"
> (finish ok)

v2.6.0

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-9339be03(quorum-v2.6.0)/linux-amd64/go1.13.10"
> (STOP SYNCING)
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)

v20.10.0 ADDED

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-af752518(quorum-v20.10.0)/linux-amd64/go1.13.15"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)

v21.1.0 ADDED

> admin.nodeInfo.name
"Geth/REG_DigitelTS-labs_2_2_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5"
> (STOP SYNCING)
> eth.getBlock(eth.defaultBlock).number
8597100
> (FAIL)

All the test are made under this enviroment:

root@alastria-01:~# ldd /usr/local/bin/geth
        linux-vdso.so.1 (0x00007ffeb65e7000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb6c3f64000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb6c3f59000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb6c3e0a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb6c3c18000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb6c3f8f000)

root@alastria-01:~# uname -a
Linux alastria-01 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@alastria-01:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"

This is a permanent error, and always reproducible for every new node of Alastria - GoQuorum network.

Thanks for the help to the entire GoQuorum team at Consensys by supporting Alastria-T network

nmvalera commented 3 years ago

Thanks this helps a lot.

Can you do the following:

full sync node from scratch using 2.5.0
once in sync, then upgrade the node to 2.6.0 without resetting the database
if ok, then upgrade to 21.1.0 without resetting the database

In the meantime, we will dig into the error.

alejandroalffer commented 3 years ago

Thanks @nmvalera!

Yes! We've have already tested the proposed workarround: once the database it's fully syncronized (with a version previous of v2.6.0), the binary can be upgraded without problem (just minor changes for metrics arguments). It also works for a direct upgrade from v2.5.0 to v21.1.0.

Please, keep the effort in searching for a solution: we are looking that the new Alastria partners can perform a direct synchronization of the nodes in full mode to maintain the "trust" of the network, using lastest versions of GoQuorum.

nmvalera commented 3 years ago

Thanks, we are discussing this internally and we will keep you updated (we may require some more information from you at some point, we'll let you know).

chris-j-h commented 3 years ago

@alejandroalffer please also review the migration docs for upgrading from earlier versions of Quorum to 2.6.0 and above. Bad block can sometimes be caused by not setting istanbulBlock and petersburgBlock in the genesis.json so it will be good to eliminate that as a possibility.

(EDIT) To summarise, please try a full sync with istanbulBlock and petersburgBlock in genesis.json so we can eliminate the possibility that this is the cause of the bad block. For now you can set them to some abritrary block very far in the future. The values can be updated later when you have an idea of when the network will be ready to move to these forks.

nmvalera commented 3 years ago

@alejandroalffer Could you please confirm that the version that you are looking to migrate from

Geth/v1.8.18-stable(quorum-v2.2.3-0.Alastria_EthNetstats_IBFT)/linux-amd64/go1.9.5

is not an official GoQuorum version but I imagine an Alastria's own custom fork?

Thanks.

nmvalera commented 3 years ago

@alejandroalffer @cmoralesdiego

Any news on the 2 topics above

?

Thanks a lot.

cmoralesdiego commented 3 years ago

Hi @nmvalera , we are going to give you feedback next week on the early week. Thanks in advance

alejandroalffer commented 3 years ago

Hi @nmvalera , @cmoralesdiego

Sorry for the delay. I've tried to restart the synchronization in full mode using different values for the istanbulBlock parameter, but always with the same result: the process stops in the block of hell ;-)

root@alastria-01:/home/iadmin# diff /root/genesis.json-original /root/genesis.json
18c18,20
<       "policy": 0
---
>       "policy": 0,
>       "petersburgBlock": 10000000,
>       "istanbulBlock": 10000000

I've tryed some values... from setting it to 0 or the last one, over the block 8597101, with the same result

The logs shows hundred of messages like VM returned with error err="evm: execution reverted" prior of failure

log.v21.1.0.txt.gz

root@alastria-01:~# md5sum /tmp/log.v21.1.0.txt.gz
e10f9eb8bfd584deaad2267f9c6da791  /tmp/log.v21.1.0.txt.gz

On the other hand, there was a fork for Alastria network, with minor updates in order to improve reporting in EthNetStats, but later versions, based on the same version of geth and new releases of GoQuorum finish the synchronization in full mode without problem:

Geth v1.8.18 · GoQuorum v2.2.3 - Alastria version, finish
Geth v1.8.18 · GoQuorum v2.4.0 - Official version, finish
Geth v1.8.18 · GoQuorum v2.5.0 - Official version, finish
Geth v1.9.7 · GoQuorum v2.6.0 - Official version, fails
Geth v1.9.7 · GoQuorum v20.10.0 - Official version, fails
Geth v1.9.7 · GoQuorum v21.1.0 - Official version, fails

IMHO, the problem appears in upgrade from Geth1.8.18 to Geth1.9.7

Best regards!

chris-j-h commented 3 years ago

@alejandroalffer from your log I see

INFO [02-22|10:56:30.757] Initialised chain configuration          config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: <nil> Istanbul: <nil> PrivacyEnhancements: <nil> Engine: istanbul}"

Petersburg: <nil> Istanbul: <nil> suggests your updated genesis is not being used. Please make sure you run geth init /path/to/updated/genesis.json to apply the genesis updates before attempting a resync.

As a comparison, I see the following in my logs when starting a node with these values set in my genesis.json:

INFO [02-23|10:49:35.554] Initialised chain configuration          config="{ChainID: 720 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 100 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 100 Istanbul: 100 PrivacyEnhancements: <nil> Engine: istanbul}"

In addition to setting istanbulBlock and petersburgBlock, you may want to try also setting constantinopleBlock.
To be clear, the values you set for these fork blocks should be a future block, otherwise you will be processing old transactions with the new protocol features these settings enable.

(EDIT: 24 Feb) See the sample genesis in quorum-examples for an example of how to configure these fork blocks.

alejandroalffer commented 3 years ago

Thanks for the feedback, @chris-j-h:

You were right: i've been using istanbulBlock, petersburgBlock parameters inside istanbul {} object. I've repeated the test in config {} object adding constantinopleBlock, as suggested, with different values... the last one far away from the end of the chain with the same result: the sync process fails in full mode:

Some logs...

INFO [02-24|23:24:05.143] Initialised chain configuration          config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 100000000 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 100000000 Istanbul: 100000000 PrivacyEnhancements: <nil> Engine: istanbul}"

Alastria network its in block ~63.000.000, and i've used 100.000.000 argument:

> eth.blockNumber
63133729

root@alastria-01:~# cat genesis.json
{
  "alloc": {
    "0x58b8527743f89389b754c63489262fdfc9ba9db6": {
      "balance": "1000000000000000000000000000"
    }
  },
  "coinbase": "0x0000000000000000000000000000000000000000",
  "config": {
    "chainId": 83584648538,
    "byzantiumBlock": 0,
    "homesteadBlock": 0,
    "eip150Block": 0,
    "eip150Hash": "0x0000000000000000000000000000000000000000000000000000000000000000",
    "eip155Block": 0,
    "eip158Block": 0,
    "istanbulBlock":       100000000 ,
    "petersburgBlock":     100000000,
    "constantinopleBlock": 100000000,
    "istanbul": {
      "epoch": 30000,
      "policy": 0,
      "petersburgBlock": 0,
      "istanbulBlock": 0
    },
    "isQuorum": true
  },
  "extraData": "0x0000000000000000000000000000000000000000000000000000000000000000f85ad594b87dc349944cc47474775dde627a8a171fc94532b8410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c0",
  "gasLimit": "0x2FEFD800",
  "difficulty": "0x1",
  "mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
  "nonce": "0x0",
  "parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
  "timestamp": "0x00"
}

The start script:

VER="v21.1.0"
export PRIVATE_CONFIG="ignore"
/usr/local/bin/geth --datadir /home/alastria/data-${VER} --networkid 83584648538 --identity REG_DigitelTS-labs_2_2_00 --permissioned --port 21000 --istanbul.requesttimeout 10000 --ethstats REG_DigitelTS-labs_2_2_00:bb98a0b6442386d0cdf8a31b267892c1@netstats.telsius.alastria.io:80 --verbosity 3 --vmdebug --emitcheckpoints --targetgaslimit 8000000 --syncmode full --gcmode full --vmodule consensus/istanbul/core/core.go=5 --nodiscover --cache 4096 2> /tmp/log.${VER}

And the result:

pi@deckard:~ $ md5sum log.v21.1.0.gz
8a5d2b1355b3e0c0690e2aafa263781f  log.v21.1.0.gz
[log.v21.1.0.gz](https://github.com/ConsenSys/quorum/files/6040662/log.v21.1.0.gz)

There's another point, and maybe its not relevant: using values 0 the fail happens in earlier block, 48704, with different error:

[...]
INFO [02-25|05:33:42.297] Initialised chain configuration          config="{ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 0 TransactionSizeLimit: 64 MaxCodeSize: 0 Petersburg: 0 Istanbul: 0 PrivacyEnhancements: <nil> Engine: istanbul}"
[...]
INFO [02-25|05:34:22.010] Imported new chain segment               blocks=2048 txs=0 mgas=0.000 elapsed=1.485s    mgasps=0.000 number=47680 hash=5c79ea…3aae91 age=2y1mo6d   dirty=0.00B
ERROR[02-25|05:34:22.736]
########## BAD BLOCK #########
Chain config: {ChainID: 83584648538 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: 0 TransactionSizeLimit: 64 MaxCodeSize: 24 Petersburg: 0 Istanbul: 0 PrivacyEnhancements: <nil> Engine: istanbul}

Number: 48704
Hash: 0x9f7f3734ad532365a2f2e10fe8f9c308d0d45ac1e018742a676fd20ce6a5f75b
         0: cumulative: 22032 gas: 22032 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0xf02ea502f6c171789bfcb686e468ad2adde0a710e66ce41155c25af30c9ac633 logs: [] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 state:
         1: cumulative: 44064 gas: 22032 contract: 0x0000000000000000000000000000000000000000 status: 1 tx: 0x942736bd1648bfce11e578aeff59ee05bc7f2d220dedfda0e97da4d36d1c123e logs: [] bloom: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 state:

Error: invalid gas used (remote: 48432 local: 44064)
##############################

WARN [02-25|05:34:22.740] Synchronisation failed, dropping peer    peer=a2054ebfafb0f0f5 err="retrieved hash chain is invalid"
ERROR[02-25|05:34:30.838]

Thanks again for not giving up!

Best regards!

chris-j-h commented 3 years ago

Making the log from @alejandroalffer's above comment https://github.com/ConsenSys/quorum/issues/1107#issuecomment-785637265 clickable: log.v21.1.0.gz

chris-j-h commented 3 years ago

@alejandroalffer as you noted earlier your logs have a large number of VM returned with error "evm: execution reverted" msgs when doing a full sync.

There are also quite a few other VM returned with error msgs (see full list below)

Do you see any of these when doing a full sync from block 0 with your current Alastria version or pre-v2.6.0 Quorum?

count	`VM returned with error`
11940	"evm: execution reverted"
190	"out of gas"
126	"stack underflow (0 <=> 1)"
23	"evm: max code size exceeded"
21	"contract creation code storage out of gas"
6	"stack underflow (0 <=> 13)"
55	"invalid opcode 0x1b"
21	"invalid opcode 0xfe"
20	"invalid opcode 0x4f"
16	"invalid opcode 0x1c"
9	"invalid opcode 0x27"
7	"invalid opcode 0xef"
6	"invalid opcode 0xa9"
3	"invalid opcode 0xd2"
3	"invalid opcode 0xda"

alejandroalffer commented 3 years ago

Hi!

I've used GoQuorum v2.5.0: the last version in which full synconization finish right. As you know, its based in Geth v1.8.18.

The "bad block" is reached and passed, and the logs in VM returned with error seems quite similar in format and in number:

root@alastria-01:/tmp# zcat log.v2.5.0.gz |grep "VM returned with error"|cut -f3- -d" "|sort|uniq -c
     20 VM returned with error                   err="contract creation code storage out of gas"
  22267 VM returned with error                   err="evm: execution reverted"
     21 VM returned with error                   err="evm: max code size exceeded"
    106 VM returned with error                   err="invalid opcode 0x1b"
     27 VM returned with error                   err="invalid opcode 0x1c"
     21 VM returned with error                   err="invalid opcode 0x23"
      7 VM returned with error                   err="invalid opcode 0x27"
     17 VM returned with error                   err="invalid opcode 0x4f"
      4 VM returned with error                   err="invalid opcode 0xa9"
      3 VM returned with error                   err="invalid opcode 0xd2"
      2 VM returned with error                   err="invalid opcode 0xda"
      7 VM returned with error                   err="invalid opcode 0xef"
     17 VM returned with error                   err="invalid opcode 0xfe"
   3823 VM returned with error                   err="out of gas"
    130 VM returned with error                   err="stack underflow (0 <=> 1)"
      6 VM returned with error                   err="stack underflow (0 <=> 13)"
      2 VM returned with error                   err="stack underflow (0 <=> 3)"

The full log here, log.v2.5.0.gz

root@alastria-01:/tmp# md5sum log.v2.5.0.gz
505f207b66846dc4e20170cd70bd7561  log.v2.5.0.gz

BTW... the process hangs near block 10.000.000, because invalid gas used. I've used a genesis.json with istanbulBlock, petersburgBlock and constantinopleBlock setted to this value, but let's keep focus in merkle tree error.

[...]
    "istanbulBlock":       10000000,
    "petersburgBlock":     10000000,
    "constantinopleBlock": 10000000,
[...]

Thanks again!

chris-j-h commented 3 years ago

@alejandroalffer said:

BTW... the process hangs near block 10.000.000, because invalid gas used. I've used a genesis.json with istanbulBlock, petersburgBlock and constantinopleBlock setted to this value, but let's keep focus in merkle tree error.
[...]
    "istanbulBlock":       10000000,
    "petersburgBlock":     10000000,
    "constantinopleBlock": 10000000,
[...]

These values should be a future block that hasn't been seen yet. In an earlier comment you said you set the values to 100,000,000. That should fix your problem.

chris-j-h commented 3 years ago

Hi @alejandroalffer

Block 8,597,101 contains 2 txs sent to 0x4F541bab8aD09638D28dAB3b25dafb64830cE96C which both execute method 0xd30528f2 (from the tx input).

I was unable to get a list of all txs to this contract on your block explorer. Do you know if this is the first block where this method on the contract is executed?
Does your network only use public transactions?

Let’s try and track down exactly where the state is deviating from what is expected:

On both:
1. A non-upgraded node that has passed block 8,597,101
2. An upgraded node that is getting the bad block error
can you do the following:
1. debug.dumpBlock('0x832e6c') and compare outputs. This is block 8,597,100 and will allow us to confirm that both are starting from the same point. Because this block was able to sync I expect they will be the same.
  
  If there is too much output you can try debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6c') which will only return the state dump for the contract involved in the problem block.
2. debug.dumpBlock('0x832e6d') and compare outputs. This is block 8,597,101. Again you can try debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6d') if there is too much output.
  
  If you get block not found on the upgraded node that's fine.

On the upgraded node that is failing to sync can you do the following:

> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')
[...]

If you see something like structLogs: [{...}, {...}, in the output also do:

> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs
[...]

> debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs
[...]

to get the full trace for both transactions in the block.

On the non-upgraded node can you do the following:

> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')

Again if the output has something like structLogs: [{...}, {...}, in the output also do:

> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs
[...]

> debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs
[...]

to get the full trace for both transactions in the block.

If you can share each of these outputs we can do some comparisons and see where the state is deviating.

If you run into any problems it may be easier to discuss on the Quorum Slack. Feel free to msg me if needed.

alejandroalffer commented 3 years ago

Hi @chris-j-h , and the rest of GoQuorum team...

Answering the questions, i make a summary:

1) I wasn't able to execute debug.dumpBlock('0x832e6c') as suggested. The access to the command its allowed only from RCP (because of a bug in consoles). Both GoQuorum 1.8.18 and GoQuorum 2.5 the process hangs after 5-7 minutes waiting, despite 24Gb dedicated servers

2) The input transaction for 8597101 block seems the same in both versions...

=== LOG GoQuorum 2.5
/geth.ipc --exec "admin.nodeInfo.name" > /tmp/log-2.5.0.txt
/geth.ipc --exec " debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs" >> /tmp/log-2.5.0.txt
/geth.ipc --exec " debug.traceBlockByHash('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs" >> /tmp/log-2.5.0.txt

=== LOG GoQuorum 21.01
/geth.ipc --exec "admin.nodeInfo.name" > /tmp/log-21.01.txt
/geth.ipc --exec "debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[0].result.structLogs" >> /tmp/log-21.01.txt
/geth.ipc --exec "debug.traceBadBlock('0xe4a2d78d83c995c1f756a7813b07b93c77b975eb5ec0a7ea7d16b6636649b2e5')[1].result.structLogs" >> /tmp/log-21.01.txt

ladmin@DESKTOP-UK0SQ8D:~$ diff log-2*
1c1
< "Geth/REG_DigitelTS-labs_2_2_00/v1.8.18-stable-685f59fb(quorum-v2.5.0)/linux-amd64/go1.11.13"
---
> "Geth/REG_DigitelTS-dev_2_8_00/v1.9.7-stable-a21e1d44(quorum-v21.1.0)/linux-amd64/go1.15.5"

The full log:

fba2df51782905ff1516d85b2ac25ac4 /tmp/log-2.5.0.txt.gz log-2.5.0.txt.gz

37eb330b4468888c117bee742f180051 /tmp/log-21.01.txt.gz log-21.01.txt.gz

The two related transactions are gotted from: https://blkexplorer1.telsius.alastria.io/transaction/0x7cf2cb779e66bbcbed4d9bc1c4dc67654fc55fd6904f768be8d9bf6f8bbb81d0 (thanks, @cmoralesdiego). The trasaccion it's executed after and before named block.

Keep in touch!

Thanks again!

alejandroalffer commented 3 years ago

Hi!

I update the status of the issue with the news of this week, to share it with the Alastria and Conensys teams.

We made some search in the chaindb (thanks for the snippet, @chris-j-h), in order to find if the method used in the badblock, 0xd30528f2

for (i = 1; i < 9999999; i++) {
    hexBlockNumber = "0x" + i.toString(16)
    txs = eth.getBlockByNumber(hexBlockNumber, true).transactions
    for (j = 0; j < txs.length; j++) {
        if (txs[j].input.slice(0, 10) == "0xd30528f2") {
            console.log("tx calling method 0xd30528f2 found in block " + i)
        }
    }
}

The result it's that this trasacction appears several blocks behind, for 657 in total:

tx calling method 0xd30528f2 found in block 7809408
[...]
tx calling method 0xd30528f2 found in block 9231310

It seems that this transaction it's not related with the synchronization problem :-(

As resume, by @chris-j-h:

[...] we’ve done a lot of investigations so far to identify the cause of the diverging state and unfortunately not had much luck. to summarise some of the results: • dump full state using API to compare: state is too large, results in out-of-memory error and crashes node • dump just state for the contract called by the transactions in block 8597101: bug in API for pre-2.6.0 quorum, doesn’t cover any contracts that might be called by the initial contract • transactions to the same contract method have been called previously so there is nothing inherently broken with the contract due to the quorum upgrade [...]

We'll keep searching for the out-of-memory crashes in order the get the results from RPC API:

debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6c') and debug.dumpAddress('0x4F541bab8aD09638D28dAB3b25dafb64830cE96C', '0x832e6d')

And this references:

Any other suggestion will be also appreciated

Thanks again @chris-j-h!

antonydenyer commented 2 years ago

I'm assuming this has been fixed now, feel free to re-open if that is not the case

lastperson commented 2 years ago

It is not the case, I'm facing the same issue in a different quorum network.

antonydenyer commented 2 years ago

Can you raise a fresh ticket with genesis and param details. Along with the exact version it stopped working at.

lastperson commented 2 years ago

Root cause identified (at least one of the consensus issues that cause invalid merkle[state] root error): Quorum version 2.1.0 marks account as dirty every time object is created: https://github.com/ConsenSys/quorum/blob/99a83767ccf0384a3b58d9caffafabb5b49bd73c/core/state/statedb.go#L407-L408

Quorum version 2.7+ and at least up to version 21.10.2 marks account as dirty only if it was NOT deleted in the same block: It still has a comment "newobj.setNonce(0) // sets the object to dirty", but that function doesn't mark obj as dirty anymore, instead it happens in the journal.append, but only for creation of object, not resetting: https://github.com/ConsenSys/quorum/blob/cd11c38e7bc0345a70ef85a8b085e7755bb0ee78/core/state/statedb.go#L695-L701

In our case bug manifested due to multiple ecrecover's in the same block. After the first call to 0x1 account it is added to the state, then removed after the transaction as empty (same behavior for both nodes). After the second call to 0x1 account, it is added to the state, then in the old node it is marked as dirty and removed after the tx, while the newer node does not mark it as dirty, and leaves it in the state, which results in the different final states.

And if someone will stumble here looking for a fix, here it is: https://github.com/ConsenSys/quorum/compare/master...Ambisafe:quorum:21.10.2-fix

antonydenyer commented 2 years ago

@lastperson good spot - do you want to submit a pull request?

lastperson commented 2 years ago

@antonydenyer I'am now checking if the latest node will sync with the fixed version, or if it has the same issue. If it will have the same issue, then I guess this fix could only be introduced as a fork configuration.

Consensys / quorum

Alastria Network · Full sync failed · GoQuorum v2.6.0 (and following) #1107