ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.4k stars 20.06k forks source link

Geth Node Sync: best metrics to track progress of state trie download? #22384

Open ernopp opened 3 years ago

ernopp commented 3 years ago

Hey everyone,

First time setting up a full node with Geth and was hoping someone could help me better track progress.

Have been running for about 8 days and I believe geth is currently doing the "state sync"

1) Based on the details and performance below, do you think the sync will ever complete? Or I need to change some settings / up the machine specs?

2) Do you have any recommendations on metrics to track the progress of the sync? I also have the InfluxDb+Grafana setup described here

Many thanks


Geth Settings

/usr/bin/sudo /usr/bin/geth --rpc --rpcaddr localhost --rpcport 8545 --datadir /ssd/ethereum --nousb --cache=8184 --syncmode fast --maxpeers 2000 --metrics --metrics.influxdb --metrics.influxdb.endpoint "http://0.0.0.0:8086" --metrics.influxdb.username "$U" --metrics.influxdb.password "$PW"

Machine Specs

GCP e2-standard-2 (2 vCPUs, 8 GB memory)

500GB SSD drive

Machine Performance

enter image description here

eth.syncing output

{
  currentBlock: 11932371,
  highestBlock: 11932458,
  knownStates: 733106883,
  pulledStates: 733044518,
  startingBlock: 11928734
}

Data Stored

> sudo du -H /ssd
28      /ssd/beacon/beaconchaindata
40      /ssd/beacon
179157676       /ssd/ethereum/geth/chaindata/ancient
340470580       /ssd/ethereum/geth/chaindata
201612  /ssd/ethereum/geth/ethash
4664    /ssd/ethereum/geth/nodes
709976  /ssd/ethereum/geth/triecache
341386840       /ssd/ethereum/geth
8       /ssd/ethereum/keystore
341386856       /ssd/ethereum
341386900       /ssd

Latest logs sample

Feb 26 11:06:49 eth-node sudo[18855]: INFO [02-26|11:06:49.022] Initializing fast sync bloom             items=291095016 errorrate=0.000 elapsed=25m47.374s
Feb 26 11:06:57 eth-node sudo[18855]: INFO [02-26|11:06:57.022] Initializing fast sync bloom             items=291116651 errorrate=0.000 elapsed=25m55.375s
Feb 26 11:06:59 eth-node sudo[18855]: WARN [02-26|11:06:59.659] Dropping unsynced node during fast sync  id=9f9bb99ec1018c88 conn=inbound addr=222.190.117.130:52684 type=Geth/v1.9.25-stable-...
Feb 26 11:07:05 eth-node sudo[18855]: INFO [02-26|11:07:05.023] Initializing fast sync bloom             items=291138086 errorrate=0.000 elapsed=26m3.375s
Feb 26 11:07:06 eth-node sudo[18855]: INFO [02-26|11:07:06.240] Imported new block headers               count=1    elapsed=12.311ms    number=11932451 hash="a37c27…bf08e3" age=1m37s
Feb 26 11:07:07 eth-node sudo[18855]: INFO [02-26|11:07:07.108] Downloader queue stats                   receiptTasks=0    blockTasks=0    itemSize=201.23KiB throttle=326
Feb 26 11:07:13 eth-node sudo[18855]: INFO [02-26|11:07:13.023] Initializing fast sync bloom             items=291157306 errorrate=0.000 elapsed=26m11.375s
Feb 26 11:07:19 eth-node sudo[18855]: WARN [02-26|11:07:19.407] Checkpoint challenge timed out, dropping id=e4c869f857022b98 conn=inbound addr=47.94.10.24:42520     type=Geth/v1.9.19-stable-...
Feb 26 11:07:21 eth-node sudo[18855]: INFO [02-26|11:07:21.023] Initializing fast sync bloom             items=291177218 errorrate=0.000 elapsed=26m19.375s
Feb 26 11:07:26 eth-node sudo[18855]: INFO [02-26|11:07:26.674] Imported new state entries               count=384  elapsed="6.344µs"   processed=733043119 pending=39982 trieretry=0   coderetry=0 duplicate=0 unexpected=41
Feb 26 11:07:29 eth-node sudo[18855]: INFO [02-26|11:07:29.035] Initializing fast sync bloom             items=291194474 errorrate=0.000 elapsed=26m27.387s
Feb 26 11:07:30 eth-node sudo[18855]: WARN [02-26|11:07:30.009] Dropping unsynced node during fast sync  id=97e32697f36f3c0c conn=dyndial addr=135.181.113.114:30303 type=Geth/v1.9.6-unstable...
Feb 26 11:07:37 eth-node sudo[18855]: INFO [02-26|11:07:37.047] Initializing fast sync bloom             items=291213828 errorrate=0.000 elapsed=26m35.400s
Feb 26 11:07:45 eth-node sudo[18855]: INFO [02-26|11:07:45.048] Initializing fast sync bloom             items=291233936 errorrate=0.000 elapsed=26m43.400s
Feb 26 11:07:52 eth-node sudo[18855]: INFO [02-26|11:07:52.194] Imported new state entries               count=247  elapsed="4.277µs"   processed=733043366 pending=43934 trieretry=0   coderetry=0 duplicate=0 unexpected=41
Feb 26 11:07:53 eth-node sudo[18855]: INFO [02-26|11:07:53.049] Initializing fast sync bloom             items=291254043 errorrate=0.000 elapsed=26m51.401s
Feb 26 11:08:01 eth-node sudo[18855]: INFO [02-26|11:08:01.049] Initializing fast sync bloom             items=291277819 errorrate=0.000 elapsed=26m59.402s
Feb 26 11:08:09 eth-node sudo[18855]: INFO [02-26|11:08:09.050] Initializing fast sync bloom             items=291299359 errorrate=0.000 elapsed=27m7.402s
Feb 26 11:08:11 eth-node sudo[18855]: INFO [02-26|11:08:11.864] Imported new block headers               count=1    elapsed=15.773ms    number=11932452 hash="3aabcb…9a783b" age=2m11s
Feb 26 11:08:12 eth-node sudo[18855]: INFO [02-26|11:08:12.277] Downloader queue stats                   receiptTasks=0    blockTasks=0    itemSize=201.26KiB throttle=326
Feb 26 11:08:16 eth-node sudo[18855]: WARN [02-26|11:08:16.116] Dropping unsynced node during fast sync  id=3049cd81df1f6e5d conn=inbound addr=176.9.143.74:40158    type=Geth/v1.9.25-stable-...
Feb 26 11:08:17 eth-node sudo[18855]: INFO [02-26|11:08:17.051] Initializing fast sync bloom             items=291320239 errorrate=0.000 elapsed=27m15.403s
Feb 26 11:08:19 eth-node sudo[18855]: WARN [02-26|11:08:19.653] Dropping unsynced node during fast sync  id=309e62d9c2fa98ba conn=inbound addr=159.117.150.121:40290 type=Geth/v1.9.9-stable-0...
Feb 26 11:08:21 eth-node sudo[18855]: INFO [02-26|11:08:21.436] Imported new block headers               count=1    elapsed=37.423ms    number=11932453 hash="45e5ee…36485a" age=2m11s/v3.0.0-...
Feb 26 11:08:21 eth-node sudo[18855]: WARN [02-26|11:08:21.799] Checkpoint challenge timed out, dropping id=95069edb97397bee conn=inbound addr=18.200.194.248:45006  type=Geth/v1.9.25-stable-...
Feb 26 11:08:24 eth-node sudo[18855]: WARN [02-26|11:08:24.636] Dropping unsynced node during fast sync  id=8948548c76d25eb4 conn=dyndial addr=71.178.43.135:30302   type=Geth/v1.9.25-stable-...=0 duplicate=0 unexpected=41
Feb 26 11:08:25 eth-node sudo[18855]: INFO [02-26|11:08:25.075] Initializing fast sync bloom             items=291338275 errorrate=0.000 elapsed=27m23.427s
Feb 26 11:01:42 eth-node sudo[18855]: INFO [02-26|11:01:42.244] Imported new block headers               count=1    elapsed=12.403ms    number=11932431 hash="1ef3ce…09d21c"
Feb 26 11:01:43 eth-node sudo[18855]: INFO [02-26|11:01:43.144] Downloader queue stats                   receiptTasks=0    blockTasks=0    itemSize=215.80KiB t
karalabe commented 3 years ago

Unfortunately there's no way to say in advance how many trie nodes there are, it's an explorative procedure that keeps expanding until it is done. Geth v1.10.0 will ship snap sync that does have support for progress monitoring (for the most part).

In your post, one particular thing is the 2000 peer limit. Having that many peers will be quite detrimental to sync as your node will keep requesting data from all of them, and if a large number of them are "junk" peers, they will just waste resources without giving you useful data. I don't think there's a meaningful reason to raise the peer count above the defaults.

AFAIK GCP e2-sandard-2 doesn't have SSD attached. Using an HDD will definitely bleed out. Attaching a remote SSD (if possible?) will probably have a very large IO latency and IOPS limitation so you might end up with a fairly low performance.

OctaveOpp commented 3 years ago

Thanks @karalabe for the great suggestion.

  1. Are the below set of monitoring metrics those you would expect during syncing please? We are now 2+ weeks into it, don't seem CPU/Disk/RAM bound, and can't find the bottleneck in our sync.

  2. Is it expected for disk writes to be so low? We're also concerned at the fact the States count is incrementing so slowly (730m -> 760m in the last week).

Machine Specs

(admittedly this is a little skinny)

GCP e2-standard-2 (2 vCPUs, 8 GB memory) 500GB SSD drive attached (15,000 IOPS / 240MB/s read/write)

Geth Parameters

/usr/bin/sudo /usr/bin/geth --rpc --rpcaddr localhost --rpcport 8545 --datadir /ssd/ethereum --nousb --cache=12278 --syncmode fast --metrics --metrics.influxdb --metrics.influxdb.endpoint "http://0.0.0.0:8086" --metrics.influxdb.username "geth" --metrics.influxdb.password "XXX"

Vps Monitoring

image

Syncing status

> web3.eth.syncing
{
  currentBlock: 11971299,
  highestBlock: 11971404,
  knownStates: 760303367,
  pulledStates: 760174176,
  startingBlock: 11959099
}
> web3.net.peerCount
50

SSD Read/Latency/IOPS tests

Read

>sudo hdparm -Ttv /dev/sdb1
`>dev/sdb1:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 Timing cached reads:   13376 MB in  1.99 seconds = 6726.19 MB/sec
 Timing buffered disk reads: 726 MB in  3.00 seconds = 241.67 MB/sec

Latency

sudo ioping /dev/sdb1
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=1 time=316 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=2 time=214 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=3 time=1.02 ms
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=4 time=528 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=5 time=215 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=6 time=575 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=7 time=611 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=8 time=761 us
4 KiB from /dev/sdb1 (block device 500.0 GiB): request=9 time=145 us
^C
--- /dev/sdb1 (block device 500.0 GiB) ioping statistics ---
9 requests completed in 8.96 s, 2.05 k iops, 8.02 MiB/s
min/avg/max/mdev = 145 us / 486 us / 1.02 ms / 273 us

IOPS

sudo fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=posixaio, iodepth=1
fio-2.2.10
Starting 1 process
random-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [F(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=4773: Thu Mar  4 11:47:15 2021
  write: io=928692KB, bw=9784.5KB/s, iops=2446, runt= 94919msec

Log Output

Mar 04 12:07:27 eth-node sudo[5509]: INFO [03-04|12:07:27.000] Imported new state entries               count=384  elapsed="4.42µs"    processed=760466124 pending=174556 trieretry=0   coderetry=0 duplicate=0 unexpected=20952
Mar 04 12:07:32 eth-node sudo[5509]: INFO [03-04|12:07:32.398] Imported new state entries               count=384  elapsed="248.761µs" processed=760466508 pending=175844 trieretry=0   coderetry=0 duplicate=0 unexpected=20952
Mar 04 12:07:34 eth-node sudo[5509]: INFO [03-04|12:07:34.541] Imported new block headers               count=1    elapsed=13.565ms    number=11971679 hash="83ab70…9b1d76"
Mar 04 12:07:37 eth-node sudo[5509]: INFO [03-04|12:07:37.288] Imported new state entries               count=384  elapsed="535.156µs" processed=760466892 pending=176850 trieretry=0   coderetry=0 duplicate=0 unexpected=20952
Mar 04 12:07:41 eth-node sudo[5509]: INFO [03-04|12:07:41.833] Imported new state entries               count=384  elapsed="282.606µs" processed=760467276 pending=178562 trieretry=0   coderetry=0 duplicate=0 unexpected=20952
Mar 04 12:07:43 eth-node sudo[5509]: INFO [03-04|12:07:43.709] Imported new block headers               count=2    elapsed=10.507ms    number=11971681 hash="25fd63…23e1ee"
Mar 04 12:07:47 eth-node sudo[5509]: INFO [03-04|12:07:47.739] Imported new state entries               count=384  elapsed="256.671µs" processed=760467660 pending=180108 trieretry=0   coderetry=0 duplicate=0 unexpected=20952
Mar 04 12:07:47 eth-node sudo[5509]: INFO [03-04|12:07:47.742] Imported new state entries               count=0    elapsed="15.281µs"  processed=760467660 pending=180108 trieretry=6   coderetry=0 duplicate=0 unexpected=20954
Mar 04 12:07:51 eth-node sudo[5509]: INFO [03-04|12:07:51.617] Imported new state entries               count=384  elapsed="366.64µs"  processed=760468044 pending=182114 trieretry=2   coderetry=0 duplicate=0 unexpected=20954
Mar 04 12:07:56 eth-node sudo[5509]: INFO [03-04|12:07:56.210] Imported new state entries               count=384  elapsed="88.725µs"  processed=760468428 pending=183708 trieretry=4   coderetry=0 duplicate=0 unexpected=20954
Mar 04 12:08:01 eth-node sudo[5509]: INFO [03-04|12:08:01.004] Imported new state entries               count=384  elapsed="405.227µs" processed=760468812 pending=185378 trieretry=0   coderetry=0 duplicate=0 unexpected=20954
Mar 04 12:08:04 eth-node sudo[5509]: INFO [03-04|12:08:04.631] Imported new state entries               count=384  elapsed="195.561µs" processed=760469196 pending=187062 trieretry=2   coderetry=0 duplicate=0 unexpected=20954
Mar 04 12:08:09 eth-node sudo[5509]: INFO [03-04|12:08:09.812] Imported new state entries               count=384  elapsed="121.964µs" processed=760469580 pending=188486 trieretry=0   coderetry=0 duplicate=0 unexpected=20954
xanoni commented 3 years ago

Also asking here as well, because there's been no indication in the other threads, yet ... I've read the FAQ and I understand that the node doesn't know by itself how far it has advanced in the sync process (except for the initial state sync phase, where it shows a percentage completion indicator).

However, can one of the metrics from the log files be used to estimate overall progress if one also has the log files of a different node that recently completed syncing? Is there some metric that could be compared across the logs?

Would this differ between Snap Sync and Fast Sync?

This would help people understand if Snap Sync is broken (which it seems to be for many here), or if it's just super slow for whatever reason.

willianpaixao commented 2 years ago

Hey @karalabe,

now that Geth v1.10.x is released and Snap syncs are present, would you point us to the metric that answers this issue? I went to the metric page but it's many hundreds of lines long and I couldn't find it.

jakubgs commented 2 years ago

I know of at least two exporters that call the RPC API in order to provide syncing metrics:

Why the Geth metrics do not provide a metric for currentBlock and highestBlock from eth_syncing is beyond me.

It seems like such an obvious thing to add to the available metrics.

willianpaixao commented 2 years ago

I personally use this one.

This issue still persists and together with #25043 would be a great improvement to the monitoring of Geth nodes.

I work in a Node as Service provider and monitoring of Geth processes is surely the biggest part our in-house tooling. Better monitoring out-of-the-box is long needed.