Closed jakub-freebit closed 11 months ago
We can't really offer much help with so old code, and particularly not when it's also modified. It is a huge timesink to investigate problems that might already be solved on master.
I'm a bit surprised that you don't see more reasons why the deliveries were not accepted. Perhaps expand --vmodule=eth=5,p2p=5,downloader=5
to also include core
at high verbosity, in case there's an error from core/vm that you are not seeing?
Thank you for your reply.
I have added core=5
to --vmodule
, and tried --verbosity 5
. Unfortunately no new errors showed up.
I will start rebasing to a newer version of geth and see if I can reproduce this issue.
(...)
INFO [07-07|17:55:19.094] Block synchronisation started
DEBUG[07-07|17:55:19.095] Synchronising with the network peer=ef23378316092d8f85540b7b17a32d5fa791018ba748ea9e663bab26b2496822 eth=67 head=1d3db9..8448a7 td=22,503,861 mode=full
DEBUG[07-07|17:55:19.095] Retrieving remote chain head peer=ef233783
DEBUG[07-07|17:55:19.095] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromhash=1d3db9..8448a7 skip=63 reverse=true
DEBUG[07-07|17:55:19.097] Remote head identified, no pivot peer=ef233783 number=99498 hash=1d3db9..8448a7
DEBUG[07-07|17:55:19.097] Looking for common ancestor peer=ef233783 local=0 remote=99498
TRACE[07-07|17:55:19.097] Span searching for common ancestor peer=ef233783 count=12 from=99321 skip=15
DEBUG[07-07|17:55:19.097] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=12 fromnum=99321 skip=15 reverse=false
TRACE[07-07|17:55:19.098] Binary searching for common ancestor peer=ef233783 start=0 end=99498
DEBUG[07-07|17:55:19.098] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=49749 skip=0 reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=24874 skip=0 reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=12437 skip=0 reverse=false
DEBUG[07-07|17:55:19.099] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=6218 skip=0 reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=3109 skip=0 reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=1554 skip=0 reverse=false
DEBUG[07-07|17:55:19.100] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=777 skip=0 reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=388 skip=0 reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=194 skip=0 reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=97 skip=0 reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=48 skip=0 reverse=false
DEBUG[07-07|17:55:19.101] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=24 skip=0 reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=12 skip=0 reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=6 skip=0 reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=3 skip=0 reverse=false
DEBUG[07-07|17:55:19.102] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=1 fromnum=1 skip=0 reverse=false
DEBUG[07-07|17:55:19.102] Found common ancestor peer=ef233783 number=0 hash=000000..000000
DEBUG[07-07|17:55:19.102] Downloading receipts origin=1
DEBUG[07-07|17:55:19.102] Directing header downloads peer=ef233783 origin=1
DEBUG[07-07|17:55:19.102] Downloading block bodies origin=1
TRACE[07-07|17:55:19.103] Fetching skeleton headers peer=ef233783 count=192 from=1
DEBUG[07-07|17:55:19.103] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=128 fromnum=192 skip=191 reverse=false
DEBUG[07-07|17:55:19.106] Filling up skeleton from=1
TRACE[07-07|17:55:19.107] Requesting new batch of headers peer=ef233783 from=1
DEBUG[07-07|17:55:19.107] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=1 skip=0 reverse=false
TRACE[07-07|17:55:19.109] Pre-scheduled new headers peer=ef23378316092d8f count=192 from=1
TRACE[07-07|17:55:19.109] Delivered new batch of headers peer=ef233783 count=192 accepted=192
TRACE[07-07|17:55:19.110] Requesting new batch of headers peer=ef233783 from=193
DEBUG[07-07|17:55:19.110] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=193 skip=0 reverse=false
TRACE[07-07|17:55:19.110] Requesting new batch of bodies peer=ef233783 count=1 from=8
DEBUG[07-07|17:55:19.110] Fetching batch of block bodies id=ef23378316092d8f conn=trusted-staticdial count=1
INFO [07-07|17:55:19.110] Downloader queue stats receiptTasks=0 blockTasks=184 itemSize=321.89B throttle=8192
DEBUG[07-07|17:55:19.110] Inserting downloaded chain items=7 firstnum=1 firsthash=8d4864..eb9087 lastnum=7 lasthash=11a1a1..78f44e
TRACE[07-07|17:55:19.110] Skeleton filling not accepted peer=ef23378316092d8f from=193
DEBUG[07-07|17:55:19.110] Failed to deliver retrieved headers peer=ef233783 err="delivery not accepted"
TRACE[07-07|17:55:19.110] Requesting new batch of headers peer=ef233783 from=385
DEBUG[07-07|17:55:19.110] Fetching batch of headers id=ef23378316092d8f conn=trusted-staticdial count=192 fromnum=385 skip=0 reverse=false
TRACE[07-07|17:55:19.110] Delivered new batch of bodies peer=ef233783 count=1 accepted=1
DEBUG[07-07|17:55:19.111] Inserted new block number=1 hash=8d4864..eb9087 uncles=0 txs=0 gas=0 elapsed="607.175µs" root=809a84..ccdfa6
DEBUG[07-07|17:55:19.111] Inserted new block number=2 hash=4c89b6..c08b02 uncles=0 txs=0 gas=0 elapsed="287.911µs" root=7aad3f..f94031
TRACE[07-07|17:55:19.111] Requesting new batch of bodies peer=ef233783 count=128 from=9
DEBUG[07-07|17:55:19.111] Inserted new block number=3 hash=051ca6..75c5ca uncles=0 txs=0 gas=0 elapsed="261.909µs" root=f0f8c2..1c8f69
DEBUG[07-07|17:55:19.112] Fetching batch of block bodies id=ef23378316092d8f conn=trusted-staticdial count=128
DEBUG[07-07|17:55:19.112] Inserted new block number=4 hash=d4d00b..dec850 uncles=0 txs=0 gas=0 elapsed="277.223µs" root=257125..85b264
TRACE[07-07|17:55:19.112] Delivered new batch of headers peer=ef233783 count=192 accepted=192
TRACE[07-07|17:55:19.112] Requesting new batch of headers peer=ef233783 from=577
(...)
A quick update on the progress:
I added some logging of my own and found out that the cause for the deliveries not being accepted is that, instead of a batch of 192 headers, only 1 header is downloaded in the batch starting from=193
(both Mainnet and Testnet) and the batch starting from=6145
(Testnet only).
After changing the donwloader's MaxHeaderFetch
param from 192 to 1 (and setting MaxCount = 1
in downloader.go:760, and commenting out the stalling peer check in downloader.go:1097), I was able to sync successfully on both networks.
Obviously this is not a solution to the problem, merely a clue. I will try rebasing to v1.11.6 next and continue the investigation.
The reason for sync failure was the following:
On the side that provides the sync data, after the old blocks where moved to freezer, the AncientRange
function failed to read some of the 192 header batches, because my modified version of Clique puts additional data into headers, which makes the batch size exceed the hard coded read size limit max := count * 700
in accessors_chain.go:337
Had geth logged a warning or error around AncientRange
, it would have been way faster to find the cause for this issue. I added a PR #28379 that addresses this.
I am using Clique PoA geth When I tried to install a NEW node it's not working. it does not sync with the network. My current node is still working fine. On the synchronization of the new nodes
I am using geth version v1.10.23-unstable
Initially, I encountered these errors during new node synchronization.
Warning: "Downloader queue stats" Warning: "Sync Failed, Dropping Peer" Error: "Retrieved hash chain is invalid: no peers available or all tried for download" Error: "Peer is unknown or unhealthy" Error: "Timeout" Error: "Pending request exists for joining peer" Error: "Expired request does not exist"
Please note that I am using a geth fork with a modified Clique consensus, however the code around sync, downloader, etc. is unchanged. The base geth version v1.10.26 I am using is quite old. If required, I will try to rebase to the latest version if this might solve the issue. I will be very grateful for any help or hints on how to solve this issue.
System information
Geth version: Geth/v1.10.26-stable-f05c9b7d/darwin-amd64/go1.20.3 CL client & version: none (pre-merge private PoA network) OS & Version: Linux (Ubuntu 20.04.4 LTS) & macOs Monterey 12.6.4 (21G526)
Background
I am operating three private PoA networks running on a geth fork with a slightly modified Clique consensus.
Recently I have noticed that new nodes on both Testnet (2) and Mainnet (3) started failing to sync (both full and snap sync) when syncing from block zero. The problem does not occur when trying to sync the blockchain data from the local testnet (1). Syncing up to the head starting from a recent block works OK on all networks. I exported the blockchain data from both Testnet (2) and Mainnet (3), imported them into a new local testnet (2 nodes) on my Mac and was able to reproduce the problem.
Expected behaviour
Fresh sync (both full and snap) completes successfully.
Actual behaviour
Geth fails at the beginning of a fresh sync (both full and snap) with the following errors.
Testnet blockchain data:
Mainnet blockchain data:
Local testnet blockchain data syncs successfully:
Steps to reproduce the behaviour
Below are the steps I follow to reproduce the problem with a two node network on my Mac.
Init the first node (the node that has the full blockchain data)
geth --datadir ./node1 init genesis.json
Init the second node (the node that tries to full sync from block zero)
geth --datadir ./node2 init genesis.json
Import the blockchain data into the first node from a file
geth --datadir ./node1 import tnet-blockchain-99498.backup
The import completes successfully so I assume this is not a problem of corrupted blockchain data (bad blocks, etc.)Run the first node
geth --datadir ./node1 --networkid 116111110106 --syncmode "full" --port 30314 --nat "extip:127.0.0.1" --authrpc.port "8602" --nodiscover --verbosity 3 --vmodule=eth=5,p2p=5,downloader=5
Run the second node
geth --datadir ./node2 --networkid 116111110106 --syncmode "full" --port 30316 --nat "extip:127.0.0.1" --authrpc.port "8603" --nodiscover --verbosity 3 --vmodule=eth=5,p2p=5,downloader=5
(I am using static-nodes.json/trusted-nodes.json for node discovery)If in step 3 I import the blockchain data from the original local testnet that was running on my Mac, the full sync completes successfully. If instead I import the blockchain data from either the Testnet or Mainnet, the full sync fails at block 193 (and at block 6145 for the Testnet data).
I found out that I can make the node sync successfully by bootstraping the sync process with some initial blockchain data. If I import the first 6228 Testnet blocks from a file, or the first 280 blocks for Mainnet, into node2 the full sync will complete successfully.
Backtrace
Init the first node (the node that has the full blockchain data) step1_init_node1_log.txt
Init the second node (the node that tries to full sync from block zero) step2_init_node2_log.txt
Import the blockchain data into the first node from a file step3_import_node1_log.txt
Run the first node step4_run_node1_log.txt
Run the second node: full sync fails step5_sync_node2_log.txt