ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.32k stars 20.03k forks source link

Private PoA network sync fails when trying to sync from block zero. #27667

Closed jakub-freebit closed 11 months ago

jakub-freebit commented 1 year ago

Please note that I am using a geth fork with a modified Clique consensus, however the code around sync, downloader, etc. is unchanged. The base geth version v1.10.26 I am using is quite old. If required, I will try to rebase to the latest version if this might solve the issue. I will be very grateful for any help or hints on how to solve this issue.

System information

Geth version: Geth/v1.10.26-stable-f05c9b7d/darwin-amd64/go1.20.3 CL client & version: none (pre-merge private PoA network) OS & Version: Linux (Ubuntu 20.04.4 LTS) & macOs Monterey 12.6.4 (21G526)

Background

I am operating three private PoA networks running on a geth fork with a slightly modified Clique consensus.

  1. Local testnet on my Mac (3 nodes)
  2. Testnet (~200 nodes)
  3. Mainnet (~3000 nodes)

Recently I have noticed that new nodes on both Testnet (2) and Mainnet (3) started failing to sync (both full and snap sync) when syncing from block zero. The problem does not occur when trying to sync the blockchain data from the local testnet (1). Syncing up to the head starting from a recent block works OK on all networks. I exported the blockchain data from both Testnet (2) and Mainnet (3), imported them into a new local testnet (2 nodes) on my Mac and was able to reproduce the problem.

Expected behaviour

Fresh sync (both full and snap) completes successfully.

Actual behaviour

Geth fails at the beginning of a fresh sync (both full and snap) with the following errors.

Testnet blockchain data:

(...)
DEBUG[07-07|12:44:54.269] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=193   skip=0   reverse=false
TRACE[07-07|12:44:54.270] Skeleton filling not accepted            peer=ef23378316092d8f from=193
DEBUG[07-07|12:44:54.270] Failed to deliver retrieved headers      peer=ef233783         err="delivery not accepted"
(...)
DEBUG[07-07|12:44:54.373] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=6145  skip=0   reverse=false
TRACE[07-07|12:44:54.373] Skeleton filling not accepted            peer=ef23378316092d8f from=6145
DEBUG[07-07|12:44:54.373] Failed to deliver retrieved headers      peer=ef233783         err="delivery not accepted"
(...)
DEBUG[07-07|12:44:54.560] Skeleton fill failed                     err="no peers available or all tried for download"
DEBUG[07-07|12:44:54.560] Skeleton chain invalid                   peer=ef233783         err="no peers available or all tried for download"
DEBUG[07-07|12:44:54.560] Header download terminated               peer=ef233783
DEBUG[07-07|12:44:54.561] Block body download terminated           err="syncing canceled (requested)"
DEBUG[07-07|12:44:54.561] Receipt download terminated              err="syncing canceled (requested)"
DEBUG[07-07|12:44:54.636] Synchronisation terminated               elapsed=381.552ms
WARN [07-07|12:44:54.636] Synchronisation failed, dropping peer    peer=ef23378316092d8f85540b7b17a32d5fa791018ba748ea9e663bab26b2496822 err="retrieved hash chain is invalid: no peers available or all tried for download"
(...)

Mainnet blockchain data:

(...)
DEBUG[07-07|12:54:16.867] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=193     skip=0   reverse=false
TRACE[07-07|12:54:16.867] Skeleton filling not accepted            peer=ef23378316092d8f from=193
DEBUG[07-07|12:54:16.867] Failed to deliver retrieved headers      peer=ef233783         err="delivery not accepted"
(...)
DEBUG[07-07|12:54:17.123] Skeleton fill failed                     err="no peers available or all tried for download"
DEBUG[07-07|12:54:17.123] Skeleton chain invalid                   peer=ef233783         err="no peers available or all tried for download"
DEBUG[07-07|12:54:17.123] Header download terminated               peer=ef233783
DEBUG[07-07|12:54:17.123] Block body download terminated           err="syncing canceled (requested)"
DEBUG[07-07|12:54:17.123] Receipt download terminated              err="syncing canceled (requested)"
DEBUG[07-07|12:54:17.175] Synchronisation terminated               elapsed=321.873ms
WARN [07-07|12:54:17.175] Synchronisation failed, dropping peer    peer=ef23378316092d8f85540b7b17a32d5fa791018ba748ea9e663bab26b2496822 err="retrieved hash chain is invalid: no peers available or all tried for download"
(...)

Local testnet blockchain data syncs successfully:

(...)
DEBUG[07-07|14:24:18.635] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=193     skip=0   reverse=false
TRACE[07-07|14:24:18.637] Pre-scheduled new headers                peer=ef23378316092d8f count=192 from=193
TRACE[07-07|14:24:18.637] Delivered new batch of headers           peer=ef233783         count=192 accepted=192
(...)
DEBUG[07-07|14:24:18.685] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=6145    skip=0   reverse=false
TRACE[07-07|14:24:18.687] Pre-scheduled new headers                peer=ef23378316092d8f count=192 from=6145
TRACE[07-07|14:24:18.687] Delivered new batch of headers           peer=ef233783         count=192 accepted=192
(...)
DEBUG[07-07|14:28:22.399] Inserting downloaded chain               items=2    firstnum=470,749 firsthash=d9cd1c..128ee4 lastnum=470,750 lasthash=639be6..56b593
DEBUG[07-07|14:28:22.399] No more headers available                peer=ef233783
DEBUG[07-07|14:28:22.400] Header download terminated               peer=ef233783
DEBUG[07-07|14:28:22.400] Block body download terminated           err=nil
DEBUG[07-07|14:28:22.400] Receipt download terminated              err=nil
INFO [07-07|14:28:22.401] Imported new chain segment               blocks=2    txs=0       mgas=0.000    elapsed=1.587ms     mgasps=0.000   number=470,750 hash=639be6..56b593 age=1w3d1h     dirty=218.47KiB
DEBUG[07-07|14:28:22.401] Synchronisation terminated               elapsed=4m3.786s
TRACE[07-07|14:28:22.403] Announced block                          hash=639be6..56b593 recipients=1 duration=2562047h47m16.854s
TRACE[07-07|14:28:22.403] Announced block                          id=ef23378316092d8f conn=trusted-staticdial number=470,750 hash=639be6..56b593
(...)

Steps to reproduce the behaviour

Below are the steps I follow to reproduce the problem with a two node network on my Mac.

  1. Init the first node (the node that has the full blockchain data) geth --datadir ./node1 init genesis.json

  2. Init the second node (the node that tries to full sync from block zero) geth --datadir ./node2 init genesis.json

  3. Import the blockchain data into the first node from a file geth --datadir ./node1 import tnet-blockchain-99498.backup The import completes successfully so I assume this is not a problem of corrupted blockchain data (bad blocks, etc.)

  4. Run the first node geth --datadir ./node1 --networkid 116111110106 --syncmode "full" --port 30314 --nat "extip:127.0.0.1" --authrpc.port "8602" --nodiscover --verbosity 3 --vmodule=eth=5,p2p=5,downloader=5

  5. Run the second node geth --datadir ./node2 --networkid 116111110106 --syncmode "full" --port 30316 --nat "extip:127.0.0.1" --authrpc.port "8603" --nodiscover --verbosity 3 --vmodule=eth=5,p2p=5,downloader=5 (I am using static-nodes.json/trusted-nodes.json for node discovery)

If in step 3 I import the blockchain data from the original local testnet that was running on my Mac, the full sync completes successfully. If instead I import the blockchain data from either the Testnet or Mainnet, the full sync fails at block 193 (and at block 6145 for the Testnet data).

I found out that I can make the node sync successfully by bootstraping the sync process with some initial blockchain data. If I import the first 6228 Testnet blocks from a file, or the first 280 blocks for Mainnet, into node2 the full sync will complete successfully.

Backtrace

  1. Init the first node (the node that has the full blockchain data) step1_init_node1_log.txt

  2. Init the second node (the node that tries to full sync from block zero) step2_init_node2_log.txt

  3. Import the blockchain data into the first node from a file step3_import_node1_log.txt

  4. Run the first node step4_run_node1_log.txt

  5. Run the second node: full sync fails step5_sync_node2_log.txt

holiman commented 1 year ago

We can't really offer much help with so old code, and particularly not when it's also modified. It is a huge timesink to investigate problems that might already be solved on master.

I'm a bit surprised that you don't see more reasons why the deliveries were not accepted. Perhaps expand --vmodule=eth=5,p2p=5,downloader=5 to also include core at high verbosity, in case there's an error from core/vm that you are not seeing?

jakub-freebit commented 1 year ago

Thank you for your reply. I have added core=5 to --vmodule, and tried --verbosity 5. Unfortunately no new errors showed up. I will start rebasing to a newer version of geth and see if I can reproduce this issue.

(...)
INFO [07-07|17:55:19.094] Block synchronisation started 
DEBUG[07-07|17:55:19.095] Synchronising with the network           peer=ef23378316092d8f85540b7b17a32d5fa791018ba748ea9e663bab26b2496822 eth=67 head=1d3db9..8448a7 td=22,503,861 mode=full
DEBUG[07-07|17:55:19.095] Retrieving remote chain head             peer=ef233783
DEBUG[07-07|17:55:19.095] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromhash=1d3db9..8448a7 skip=63 reverse=true
DEBUG[07-07|17:55:19.097] Remote head identified, no pivot         peer=ef233783 number=99498 hash=1d3db9..8448a7
DEBUG[07-07|17:55:19.097] Looking for common ancestor              peer=ef233783 local=0 remote=99498
TRACE[07-07|17:55:19.097] Span searching for common ancestor       peer=ef233783 count=12  from=99321 skip=15
DEBUG[07-07|17:55:19.097] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=12  fromnum=99321 skip=15 reverse=false
TRACE[07-07|17:55:19.098] Binary searching for common ancestor     peer=ef233783 start=0 end=99498
DEBUG[07-07|17:55:19.098] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=49749 skip=0  reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=24874 skip=0  reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=12437 skip=0  reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=6218  skip=0  reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=3109  skip=0  reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=1554  skip=0  reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=777   skip=0  reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=388   skip=0  reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=194   skip=0  reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=97    skip=0  reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=48    skip=0  reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=24    skip=0  reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=12    skip=0  reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=6     skip=0  reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=3     skip=0  reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=1   fromnum=1     skip=0  reverse=false
DEBUG[07-07|17:55:19.102] Found common ancestor                    peer=ef233783 number=0     hash=000000..000000
DEBUG[07-07|17:55:19.102] Downloading receipts                     origin=1
DEBUG[07-07|17:55:19.102] Directing header downloads               peer=ef233783 origin=1
DEBUG[07-07|17:55:19.102] Downloading block bodies                 origin=1
TRACE[07-07|17:55:19.103] Fetching skeleton headers                peer=ef233783 count=192 from=1
DEBUG[07-07|17:55:19.103] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=128 fromnum=192   skip=191 reverse=false
DEBUG[07-07|17:55:19.106] Filling up skeleton                      from=1
TRACE[07-07|17:55:19.107] Requesting new batch of headers          peer=ef233783 from=1
DEBUG[07-07|17:55:19.107] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=1     skip=0   reverse=false
TRACE[07-07|17:55:19.109] Pre-scheduled new headers                peer=ef23378316092d8f count=192 from=1
TRACE[07-07|17:55:19.109] Delivered new batch of headers           peer=ef233783         count=192 accepted=192
TRACE[07-07|17:55:19.110] Requesting new batch of headers          peer=ef233783         from=193
DEBUG[07-07|17:55:19.110] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=193   skip=0   reverse=false
TRACE[07-07|17:55:19.110] Requesting new batch of bodies           peer=ef233783         count=1   from=8
DEBUG[07-07|17:55:19.110] Fetching batch of block bodies           id=ef23378316092d8f conn=trusted-staticdial count=1
INFO [07-07|17:55:19.110] Downloader queue stats                   receiptTasks=0 blockTasks=184 itemSize=321.89B throttle=8192
DEBUG[07-07|17:55:19.110] Inserting downloaded chain               items=7 firstnum=1 firsthash=8d4864..eb9087 lastnum=7 lasthash=11a1a1..78f44e
TRACE[07-07|17:55:19.110] Skeleton filling not accepted            peer=ef23378316092d8f from=193
DEBUG[07-07|17:55:19.110] Failed to deliver retrieved headers      peer=ef233783         err="delivery not accepted"
TRACE[07-07|17:55:19.110] Requesting new batch of headers          peer=ef233783         from=385
DEBUG[07-07|17:55:19.110] Fetching batch of headers                id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=385   skip=0   reverse=false
TRACE[07-07|17:55:19.110] Delivered new batch of bodies            peer=ef233783         count=1   accepted=1
DEBUG[07-07|17:55:19.111] Inserted new block                       number=1     hash=8d4864..eb9087 uncles=0 txs=0 gas=0 elapsed="607.175µs" root=809a84..ccdfa6
DEBUG[07-07|17:55:19.111] Inserted new block                       number=2     hash=4c89b6..c08b02 uncles=0 txs=0 gas=0 elapsed="287.911µs" root=7aad3f..f94031
TRACE[07-07|17:55:19.111] Requesting new batch of bodies           peer=ef233783         count=128 from=9
DEBUG[07-07|17:55:19.111] Inserted new block                       number=3     hash=051ca6..75c5ca uncles=0 txs=0 gas=0 elapsed="261.909µs" root=f0f8c2..1c8f69
DEBUG[07-07|17:55:19.112] Fetching batch of block bodies           id=ef23378316092d8f conn=trusted-staticdial count=128
DEBUG[07-07|17:55:19.112] Inserted new block                       number=4     hash=d4d00b..dec850 uncles=0 txs=0 gas=0 elapsed="277.223µs" root=257125..85b264
TRACE[07-07|17:55:19.112] Delivered new batch of headers           peer=ef233783         count=192 accepted=192
TRACE[07-07|17:55:19.112] Requesting new batch of headers          peer=ef233783         from=577
(...)
jakub-freebit commented 11 months ago

A quick update on the progress:

I added some logging of my own and found out that the cause for the deliveries not being accepted is that, instead of a batch of 192 headers, only 1 header is downloaded in the batch starting from=193 (both Mainnet and Testnet) and the batch starting from=6145 (Testnet only).

After changing the donwloader's MaxHeaderFetch param from 192 to 1 (and setting MaxCount = 1 in downloader.go:760, and commenting out the stalling peer check in downloader.go:1097), I was able to sync successfully on both networks.

Obviously this is not a solution to the problem, merely a clue. I will try rebasing to v1.11.6 next and continue the investigation.

jakub-freebit commented 11 months ago

The reason for sync failure was the following:

On the side that provides the sync data, after the old blocks where moved to freezer, the AncientRange function failed to read some of the 192 header batches, because my modified version of Clique puts additional data into headers, which makes the batch size exceed the hard coded read size limit max := count * 700 in accessors_chain.go:337

Had geth logged a warning or error around AncientRange, it would have been way faster to find the cause for this issue. I added a PR #28379 that addresses this.

AsadullahSoomro commented 5 months ago

I am using Clique PoA geth When I tried to install a NEW node it's not working. it does not sync with the network. My current node is still working fine. On the synchronization of the new nodes

I am using geth version v1.10.23-unstable

Initially, I encountered these errors during new node synchronization.

Warning: "Downloader queue stats" Warning: "Sync Failed, Dropping Peer" Error: "Retrieved hash chain is invalid: no peers available or all tried for download" Error: "Peer is unknown or unhealthy" Error: "Timeout" Error: "Pending request exists for joining peer" Error: "Expired request does not exist"

image