Geth refuses to sync after being a few months offline

elsampsa commented 2 years ago

Weird behaviour from geth when putting a client online after it's been offline for a few months. I would really appreciate if someone would explain me what's going on / put me on the right track.

So far happened:

On Jan 22 did a fresh snap sync from scratch. The blockchain was synced within a matter of days. Awesome!
Wen't offline for the summer. Let's call the world state before summer "S0".
On Sept 22 went online again & wanted to go POS, so updated to geth 1.10.23 and started using prysm 3.0.0.
After going online again, geth refuses to reconstruct the current world state "SC".

I am using the "--syncmode=snap" option. When "--syncmode=snap" is enabled, I have to admit I am not sure which one of the two ways, (A) or (B), geth is syncing:

A. The full-node way:

Picks up the world state "S0" where it was left in before summer.
Starts executing the blocks, until the state has been updated from "S0"->"SC"

B. Snapshot download:

Proceeds in downloading state snapshot fragments from other peers
..using snapshot packets, reconstructs the state from scratch into "SC"?
..or tries to do an incremental heal of "S0" --> "SC" using snapshot fragments (sounds silly, right? I guess I'm just imaging this)

Is there some criteria which one of the two ways geth ends up doing? Say, if state is enough old, then do (B) instead of (A)?

Obviously, I would prefer (A), since I'm not syncing from scratch. Should I then remove "--syncmode=snap" ?

Next, let's take a look at the logs:

syys 05 11:28:36 sampsa-asus geth[11610]: INFO [09-05|11:28:36.595] Imported new block headers               count=1    elapsed=4.940ms     number=15,476,571 hash=0cb37e..4b58b2 age=1h51m15s
syys 05 11:28:46 sampsa-asus geth[11610]: INFO [09-05|11:28:46.906] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,163,945@58.88GiB pending=1
syys 05 11:29:04 sampsa-asus geth[11610]: WARN [09-05|11:29:04.616] Snapshot extension registration failed   peer=35ba3061 err="peer connected on snap without compatible eth support"
syys 05 11:29:08 sampsa-asus geth[11610]: INFO [09-05|11:29:08.040] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,163,945@58.88GiB pending=1
syys 05 11:29:08 sampsa-asus geth[11610]: INFO [09-05|11:29:08.096] Imported new block headers               count=1    elapsed=5.270ms     number=15,476,572 hash=eab73d..75ece5 age=1h51m42s
syys 05 11:29:08 sampsa-asus geth[11610]: WARN [09-05|11:29:08.098] Dropping unsynced node during sync       id=c83d584aa6a7cf69 conn=inbound addr=146.0.216.163:40678   type=Geth/v1.10.23-stable...
syys 05 11:29:23 sampsa-asus geth[11610]: WARN [09-05|11:29:23.229] Snapshot extension registration failed   peer=2b87dc19 err="peer connected on snap without compatible eth support"
syys 05 11:29:27 sampsa-asus geth[11610]: WARN [09-05|11:29:27.443] Snapshot extension registration failed   peer=6de3885d err="peer connected on snap without compatible eth support"
syys 05 11:30:04 sampsa-asus geth[11610]: INFO [09-05|11:30:04.652] Imported new block headers               count=1    elapsed=5.077ms     number=15,476,573 hash=695338..95f981 age=1h52m27s
syys 05 11:30:05 sampsa-asus geth[11610]: INFO [09-05|11:30:05.349] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=254.59KiB throttle=1030
syys 05 11:30:48 sampsa-asus geth[11610]: INFO [09-05|11:30:48.716] Imported new block headers               count=2    elapsed=5.064ms     number=15,476,575 hash=664007..cdd2dd age=1h52m55s
syys 05 11:30:49 sampsa-asus geth[11610]: WARN [09-05|11:30:49.374] Pivot became stale, moving               old=15,476,449 new=15,476,513
syys 05 11:30:49 sampsa-asus geth[11610]: INFO [09-05|11:30:49.504] Imported new block receipts              count=64   elapsed=129.241ms   number=15,476,512 hash=14ddfa..f13f36 age=2h9m6s    size=8.55MiB
syys 05 11:30:49 sampsa-asus geth[11610]: INFO [09-05|11:30:49.504] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,163,945@58.88GiB pending=1
syys 05 11:30:55 sampsa-asus geth[11610]: INFO [09-05|11:30:55.112] Imported new block headers               count=1    elapsed=4.197ms     number=15,476,576 hash=413332..413f19 age=1h52m57s
syys 05 11:31:23 sampsa-asus geth[11610]: INFO [09-05|11:31:23.472] Imported new block headers               count=1    elapsed=4.914ms     number=15,476,577 hash=192227..5f294e age=1h52m55s
syys 05 11:31:24 sampsa-asus geth[11610]: INFO [09-05|11:31:24.524] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=260.17KiB throttle=1008
syys 05 11:31:36 sampsa-asus geth[11610]: INFO [09-05|11:31:36.180] Imported new block headers               count=2    elapsed=9.148ms     number=15,476,579 hash=1d9258..ef0920 age=1h52m52s
syys 05 11:32:04 sampsa-asus geth[11610]: INFO [09-05|11:32:04.529] Imported new block headers               count=2    elapsed=6.110ms     number=15,476,581 hash=37288d..497e7a age=1h53m14s
syys 05 11:32:07 sampsa-asus geth[11610]: INFO [09-05|11:32:07.813] Imported new block headers               count=6    elapsed=17.703ms    number=15,476,587 hash=dbcbde..6a6aa8 age=1h52m40s
syys 05 11:32:11 sampsa-asus geth[11610]: INFO [09-05|11:32:11.064] Imported new block headers               count=2    elapsed=5.816ms     number=15,476,589 hash=f78c93..d547a7 age=1h52m35s
syys 05 11:32:14 sampsa-asus geth[11610]: INFO [09-05|11:32:14.329] Imported new block headers               count=1    elapsed=5.343ms     number=15,476,590 hash=1d970a..dbb580 age=1h52m36s
syys 05 11:32:17 sampsa-asus geth[11610]: INFO [09-05|11:32:17.596] Imported new block headers               count=1    elapsed=5.462ms     number=15,476,591 hash=8efbe0..077eb1 age=1h52m28s
syys 05 11:32:24 sampsa-asus geth[11610]: INFO [09-05|11:32:24.007] Imported new block headers               count=2    elapsed=13.627ms    number=15,476,593 hash=9d2ae6..f64c80 age=1h51m17s
syys 05 11:32:24 sampsa-asus geth[11610]: INFO [09-05|11:32:24.562] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=372.89KiB throttle=704
syys 05 11:32:30 sampsa-asus geth[11610]: INFO [09-05|11:32:30.406] Imported new block headers               count=1    elapsed=5.857ms     number=15,476,594 hash=59a38c..b964ff age=1h50m21s
syys 05 11:32:33 sampsa-asus geth[11610]: INFO [09-05|11:32:33.696] Imported new block headers               count=1    elapsed=5.346ms     number=15,476,595 hash=a0f7d8..ccb3e3 age=1h50m1s
syys 05 11:32:40 sampsa-asus geth[11610]: INFO [09-05|11:32:40.090] Imported new block headers               count=3    elapsed=5.102ms     number=15,476,598 hash=961009..fc298a age=1h49m9s
syys 05 11:32:43 sampsa-asus geth[11610]: INFO [09-05|11:32:43.361] Imported new block headers               count=1    elapsed=4.918ms     number=15,476,599 hash=05bd57..d6f4b9 age=1h49m
syys 05 11:32:49 sampsa-asus geth[11610]: INFO [09-05|11:32:49.788] Imported new block headers               count=6    elapsed=22.194ms    number=15,476,605 hash=91c36f..a618eb age=1h48m28s
syys 05 11:32:53 sampsa-asus geth[11610]: INFO [09-05|11:32:53.036] Imported new block headers               count=1    elapsed=4.784ms     number=15,476,606 hash=d3a357..1cb4d0 age=1h48m30s
syys 05 11:32:56 sampsa-asus geth[11610]: INFO [09-05|11:32:56.313] Imported new block headers               count=1    elapsed=4.866ms     number=15,476,607 hash=c0fc48..6b72f0 age=1h48m30s
syys 05 11:32:59 sampsa-asus geth[11610]: INFO [09-05|11:32:59.583] Imported new block headers               count=1    elapsed=5.789ms     number=15,476,608 hash=f7470d..d2f3c5 age=1h48m31s

Blocks are imported at a good pace? A few milliseconds per block.
"Pivot became stale, moving" : does this has to do something with downloading the snapshot from other peers? .. starting it all over again? Is it doing (A) or (B) (see above)?
State heal in progress reports the number of accounts in ethereum.. this should be near hundred millions at the moment? Why the state is healed/updated soooo slow? (this is after being a few days online).

Let's continue:

syys 05 11:50:13 sampsa-asus geth[11610]: INFO [09-05|11:50:13.381] Disabling direct-ancient mode            origin=15,477,053 ancient=14,627,238
syys 05 11:50:13 sampsa-asus geth[11610]: INFO [09-05|11:50:13.381] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,168,162@58.88GiB pending=1
syys 05 11:50:13 sampsa-asus geth[11610]: INFO [09-05|11:50:13.750] Imported new block headers               count=9    elapsed=90.869ms    number=15,477,128 hash=8cad3b..8a6686 age=1m10s     ignored=66
syys 05 11:50:17 sampsa-asus geth[11610]: INFO [09-05|11:50:17.219] Imported new block headers               count=1    elapsed=4.008ms     number=15,477,129 hash=4a8b20..97bd42 age=1m10s
syys 05 11:50:18 sampsa-asus geth[11610]: INFO [09-05|11:50:18.430] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=262.34KiB throttle=1000
syys 05 11:50:18 sampsa-asus geth[11610]: INFO [09-05|11:50:18.451] Imported new block receipts              count=10   elapsed=20.515ms    number=15,477,063 hash=68f457..1a4bca age=14m52s    size=2.11MiB
syys 05 11:50:18 sampsa-asus geth[11610]: INFO [09-05|11:50:18.451] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,168,162@58.88GiB pending=1
syys 05 11:50:25 sampsa-asus geth[11610]: INFO [09-05|11:50:25.672] Imported new block headers               count=1    elapsed=6.006ms     number=15,477,130 hash=5c6caa..88b273
syys 05 11:50:29 sampsa-asus geth[11610]: INFO [09-05|11:50:29.716] State heal in progress                   accounts=11,848,726@533.25MiB slots=34,874,749@2.60GiB    codes=107,715@798.25MiB nodes=363,169,396@58.88GiB pending=19532
syys 05 11:50:40 sampsa-asus geth[11610]: WARN [09-05|11:50:40.869] Snapshot extension registration failed   peer=1125cc47 err="peer connected on snap without compatible eth support"
syys 05 11:50:47 sampsa-asus geth[11610]: INFO [09-05|11:50:47.686] Imported new block headers               count=1    elapsed=8.713ms     number=15,477,131 hash=0145a7..0ce6dc

What is disabling direct-ancient mode? Does it has to do with the problem at hand?
Note the "State heal in progress" reporting number of pending accounts jumping from 1 to 19532. Why's that?
Note that donwloaded and verified blocks are practically there: "age=1m10s"

Then we have a lot of this going on:

syys 05 16:08:19 sampsa-asus geth[11610]: WARN [09-05|16:08:19.561] Unexpected bytecode heal packet          peer=905d36fa reqid=7,304,675,219,279,965,590

what's that about? Haven't seen a proper explanation of this anywhere.

The client was running ~ 4 days until I got fed up and put it offline. Got this far:

syys 05 16:08:58 sampsa-asus geth[11610]: INFO [09-05|16:08:58.909] State heal in progress                   accounts=12,866,567@579.80MiB slots=37,899,809@2.82GiB    codes=111,459@826.38MiB nodes=371,245,285@60.02GiB pending=188,584

I interprete that downloaded and verified blocks are ok (it seem to reach the latest block occasionally), but the client is unable to reconstruct the up-to-date world state "SC".

I don't understand why. Once the block is downloaded and verified you get the state change, so what is interfering here with the state update/reconstruction?

Is there some competition/contradiction going on between (A) and (B)? (see above)

Some details on my system: 8 x Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz 2TB SSD drive: "Crucial BX500" 12 GB RAM (& all the swap you could wish for) (as mentioned, this was able to snapsync from scratch before summer)

P. S. It would be important to be able to contribute to the network without being a 24/7 avail/on-call eth-client-massage service provider.. to be offline if needed, etc. Imagine that I'd done staking at this point..!

MariusVanDerWijden commented 2 years ago

It looks like you were not fully snap synced when you shut down your node the last time. You were most likely still in the state heal phase. Otherwise geth would import blocks one by one afterwards. The problem now is that you have a really old state that is not fully healed, so a lot of accounts, balances etc are different than the current state. In the state healing phase these states are fetched one by one from the network. Since they are so many differences between you and the network, the state heal phase will likely never finish. The best way to fix this is to stop geth and remove the old state that you have (via geth removedb y 'n`). This will force the node to start a new snap sync which should finish in a couple of hours.

Regarding your PS: nodes can recover very quickly from a couple of days of downtime once they are fully synced (and healed). Since the gas limit in ethereum was voted up and is pretty high atm, the nodes have to do a lot of computation, especially if they want to sync up to the network

elsampsa commented 2 years ago

Thanks for the reply. But I am 100% sure that it synced.. checked it form the geth console, etc.

I guess there's nothing else left to do than to do yet another snap sync & I keep this ticket still open a few days just in case..

To confirm.. once the snap sync is over, after that geth will behave like a normal "full" node.. there's no condition/bug? under which it would start a snap sync again. And it's pitty no-one never explains that "Unexpected bytecode heal packet"..!

MariusVanDerWijden commented 2 years ago

Yes a node will never start a sync again after its been synced once The "unexpected bytecode heal packet" is printed either when a peer sends us data that we haven't requested or if a reply arrives after the request already timed out (so you request something, the peer takes too long to respond, you stop tracking the request, the peer finally responds)

elsampsa commented 2 years ago

I'm running

geth removedb

Should I remove ancient database as well..?

I keps the ancient database & now it seems to work like a full node..?

syys 12 08:23:08 sampsa-asus geth[2567]: INFO [09-12|08:23:08.248] Imported new block receipts              count=18   elapsed=135.821ms   number=14,627,793 hash=769804..0be43a age=4mo3w2d
....
syys 12 08:23:03 sampsa-asus geth[2567]: WARN [09-12|08:23:03.989] Pivot seemingly stale, moving            old=15,519,069 new=15,519,133

State heal logs missing at the moment

aathan commented 2 years ago

I've spent some time looking around and trying to understand the differences between geth's sync modes but am still left with some open questions. For example: Does a geth node that either snap or fast sync's initially, end up with the same database state as a "full node" that has executed all of the transactions since the genesis block? Or does snap sync involve only acquiring a "digest" of the storage state of all addresses as of the sync point, such that future requests made to this node for storage as of some older block, would then involve the node making new requests to peers to acquire that older state? Please forgive me if any of this is malformed. I'm asking these questions having made some possibly incorrect assumptions about "everything."

Having just written these questions, I thought of a better google query, and found this: https://github.com/ethereum/go-ethereum/issues/24413

Probably a good idea to incorporate some/all of the wisdom there, into the docs.

elsampsa commented 2 years ago

After doing ``geth removedb" (but keeping ancient database/freezer) now it's in the state heal process again. It downloaded all missing blocks and after that "state heal" logs started to appear.

Now, as always, logs are saturated with:

Unexpected trienode heal packet

State healing is painstakingly slow:

syys 13 16:51:30 sampsa-asus geth[1325]: INFO [09-13|16:51:30.912] State heal in progress                   accounts=524,748@26.14MiB     slots=1,009,090@76.24MiB    codes=877@6.82MiB     nodes=5,091,192@1.26GiB    pending=29246

Maybe 500 000 accounts per 24h ..?

So geth removedb didn't help. Now what? Should I also remove also the "ancient" database and try to do snap sync?

download recent global state "SC0" using snap

then that global state needs to be "healed" using the downloaded blocks "SC0" --> blocks --> "SC1" (current global state)

Nobody knows why.. just (clueless) nerds (like myself) trying to scrap information from the internet. Do you have any "testing" lab where you put clients on & offline & try to disturb them in some private geth ethereum network? Or does "ci/cd testing" equal to "merge" (I hope not).

EDIT: maybe the ancient database is corrupt..? Does the ancient database include a state trie that is loaded in the beginning & then it tries to heal it.. ..or is the latest state fetched with snap sync if I have done geth removedb (but not removed the ancient database)

aathan commented 2 years ago

On a related note, I ended up with the impression that it has become quite difficult to stand up a full node in lieu of using centralized providers such as alchemy/moralis/infura etc because of these types of troubles (and perhaps my last attempt was when snap sync was too new). That slowness and/or troubles in this area ends up pushing dApp providers to rely on centralized resources such as Infura/Moralis/Alchemy (vs it being super easy to stand up their own nodes) may be an issue to think about (pros and cons to each).

It would be really nice to be able to stand up a full node that can answer historical requests (gcmode archive) quickly by acquiring that database and/or being able to forward such queries to/from peers, and then verifying it on a best-efforts basis over a longer timeframe. I'm not fully conversant on the security implications and/or implementation difficulties and/or bandwidth/resource management issues of doing such a thing -- but I sense it would help reduce centralization onto a handful of dApp service providers.

Sreeram1993 commented 2 years ago

Similar scenarios in my node as well. I have 2 geth archival nodes connected to 2 beacon nodes and both seems to have completed merge. But the nodes seems to be syncing very slowly. PFB the status from both nodes

GETH node 0 - INFO [09-21|06:25:06.231] Imported new potential chain segment blocks=1 txs=295 mgas=22.221 elapsed=9.974s mgasps=2.228 number=15,559,291 hash=0961a7..1d9571 age=2d21h47m dirty=0.00B node 1 - INFO [09-21|06:25:07.196] Imported new potential chain segment blocks=1 txs=371 mgas=27.281 elapsed=9.188s mgasps=2.969 number=15,557,806 hash=9b6067..887a31 age=3d2h46m dirty=0.00B

BEACON node 0 - time="2022-09-21 06:26:27" level=info msg="Processing block batch of size 64 starting from 0xb47a08d1... 4722176/4743130 - estimated time remaining 15m39s" blocksPerSecond=22.3 peers=49 prefix=initial-sync node 1 - time="2022-09-21 06:25:20" level=info msg="Processing block batch of size 64 starting from 0x2ff82104... 4720690/4743124 - estimated time remaining 29m19s" blocksPerSecond=12.8 peers=45 prefix=initial-sync

The slots(3 days behind current slot) in beacon seems to be in sync with the block shown in geth node. Can anyone confirm if this is healthy and our sync is well and good but a bit slow ?

holiman commented 2 years ago

Now, as always, logs are saturated with:

Unexpected trienode heal packet

This has actually been fixed, but not pushed to stable(a.k.a v1.10.xx). If you use latest (a.k.a 1.11.xx -unstable) you should not see these messages (as much). Several issues related to snap-sync were fixed, and in many cases that were previously unable to sync due to IO performance is now made possible.

So, I recommend switching geth-version.

So geth removedb didn't help. Now what? Should I also remove also the "ancient" database and try to do snap sync?

I recommend

Use 1.11
do geth removedb but keep ancients intact.

@Sreeram1993

node 0 - INFO [09-21|06:25:06.231] Imported new potential chain segment blocks=1 txs=295 mgas=22.221 elapsed=9.974s mgasps=2.228 number=15,559,291 hash=0961a7..1d9571 age=2d21h47m dirty=0.00B
node 1 - INFO [09-21|06:25:07.196] Imported new potential chain segment blocks=1 txs=371 mgas=27.281 elapsed=9.188s mgasps=2.969 number=15,557,806 hash=9b6067..887a31 age=3d2h46m dirty=0.00B

These numbers aren't great -- as you can sees, importing one block takes almost 10s. That is very close to the block-time. Obviously, if import is slower than block-time, then you will not be able to keep up. Other than that, yeah, looks ok to me

holiman commented 2 years ago

Oh, btw -- is there really a point in keeping this bug open? Seems like the discussion has trailed off into new areas?

Sreeram1993 commented 2 years ago

Thanks for the update @holiman Is there anyway to increase the import time, I am already using an 16TB SSD and the nodes are in AKS as well. Any sol to increase peers or network bandwidth so that archival node stays upto date ?

holiman commented 2 years ago

Increasing peers will most likely have a negative impact. I don't know what AKS means.

If you are interested in trying something new, we have a new database backend which we think will be an improvement: https://github.com/ethereum/go-ethereum/pull/24615 . Unfortunately, we don't have any conversion-tool, so the only thing one can do with it is to sync from scratch -- which is a PITA with archive node, I know, so I don't actually expect you to try it out.

Other than that, I don't have a lot of ideas.

Sreeram1993 commented 2 years ago

@holiman Got it fixed and I started running. But I get this INFO for blocks multiple times for a same block.

--GETH-- INFO [09-22|13:01:36.034] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=1m19.317s mgasps=0.317 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B INFO [09-22|13:01:36.176] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=142.118ms mgasps=176.686 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B INFO [09-22|13:01:36.318] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=141.422ms mgasps=177.556 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B --GETH-- Certain blocks show import just once while some shows multiple info messages like above. Is it corrupting the node ?

Yvelo commented 1 year ago

I have the same issue after upgrading geth syncmode from Full to Snap. 12 Hours later it is still syncing with no clear indication on when or if the process will end. There are some posts that the parameter "accounts" should reach the total number of Ethereum accounts (201M today) for the snap sync process to end. Which in this case, means it will never ends. Or shall I simply be more patient?

Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889

Yvelo commented 1 year ago

In an attempt to follow @holiman recommendations, I failed to locate the latest version of geth (a.k.a 1.11.xx -unstable). The last version published on github is the 1.10.25. A google seach on "go-ethereum v1.11.1" is not helpful either. Any help is welcome

Yvelo commented 1 year ago

All it took was enough patience. Snap Sync finalized after 20 hours on a GCloud 16vCPU (e2-standard-16) + Balanced Persistent Disks. It will be downgraded to e2-standard-4 now that it is fully synced. Contrary to what is reported in other posts the parameter accounts in the log file below does not need to reach the total number of Ethereum accounts for the snap sync process to finalize.

Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889

Sreeram1993 commented 1 year ago

@Yvelo The issue is mine is an archival sync node :D So I think I need a lot more patience. Happily it has reached 6 hours behind

holiman commented 1 year ago

I failed to locate the latest version of geth (a.k.a 1.11.xx -unstable). The last version published on github is the 1.10.25. A google seach on "go-ethereum v1.11.1" is not helpful either. Any help is welcome

Didn't see this until now. On any of our "Releases" pages, e.g. https://github.com/ethereum/go-ethereum/releases/tag/v1.10.25 , we link to

Pre-built binaries for all platforms on our downloads page.

On that download-page, we have 'stable' and 'develop' builds.

felix-halim commented 1 year ago

I am doing sync from scratch with geth version: 1.10.25-stable

With this command line: geth --cache=12048 --http --datadir /path/to/geth/snap-data --authrpc.jwtsecret /path/to/jwt.hex

The datadir is now about 605G, and the "State heal in progress" started about 2 days ago, and now the logs is showing:

INFO [10-18|00:41:59.699] State heal in progress accounts=1,800,572@96.80MiB slots=4,221,205@308.02MiB codes=2596@19.81MiB nodes=21,541,853@5.58GiB pending=26107

So it is progressing about 1M accounts per day? At this rate, I need to wait 200 days for it to be fully sync to 200M accounts?

@Yvelo can you share what is the number of account shown in your logs when it's fully sync? is it around 169K accounts?

What indicator in the log that can be used to estimate the time for the "State heal" to finish?

If I run this command:

geth attach http://localhost:8545

and type:

> eth.syncing
{
  currentBlock: 15773502,
  healedBytecodeBytes: 20841216,
  healedBytecodes: 2605,
  healedTrienodeBytes: 6011477028,
  healedTrienodes: 21616064,
  healingBytecode: 0,
  healingTrienodes: 1153,
  highestBlock: 15773610,
  startingBlock: 0,
  syncedAccountBytes: 45757800173,
  syncedAccounts: 192540876,
  syncedBytecodeBytes: 4627746078,
  syncedBytecodes: 700408,
  syncedStorage: 814610039,
  syncedStorageBytes: 174772608597
}

There is no saying what is the "highest" number of trie nodes are there?

holiman commented 1 year ago

So it is progressing about 1M accounts per day? At this rate, I need to wait 200 days for it to be fully sync to 200M accounts?

No, that's not how it works. But I'm curious, why are you using 1.10.25? As I've recommended above, please use latest a.k.a 1.11.xx which has a lot of snap-sync related fixes.

There is no saying what is the "highest" number of trie nodes are there?

There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric.

It's like two people walking upwards in an escalator going downwards: compare how many steps they take to reach the top. But they walk at different speeds. And one started near the bottom, another close to the top.

felix-halim commented 1 year ago

Thanks for the prompt reply, @holiman!

I used 1.10.25 because I started using geth around 5 days ago without knowing about this bug. I just I read about this bug now. Hearing that I should delete everything and start from scratch again feels discouraging.

If I were to use 1.11.xx, what is the expected time to sync from scratch to full? Should I use the same command line as above? or is there a faster way to do it? perhaps like a checkpoint sync?

I am wondering, so there is no good metric at all to tell what is the ETA for the state heal? I don't need to compare with other people if the log shows any indicator about the ETA ...

holiman commented 1 year ago

Hearing that I should delete everything and start from scratch again feels discouraging.

Just switch then, you don't have to restart from scratch.

If I were to use 1.11.xx, what is the expected time to sync from scratch to full? Should I use the same command line as above? or is there a faster way to do it? perhaps like a checkpoint sync?

A couple of days. Yes. No. No.

I am wondering, so there is no good metric at all to tell what is the ETA for the state heal? I don't need to compare with other people if the log shows any indicator about the ETA ...

No. If there were, you can rest assured we would have added it and displayed it promptly to the users. It's something that users have requested consistently since 2016.

felix-halim commented 1 year ago

I cannot install 1.11.xx from brew on MacOS using this doc:

https://geth.ethereum.org/docs/install-and-build/installing-geth

Running:

brew install ethereum --devel

I got:

Error: invalid option: --devel

If I go to the downloads page:

https://geth.ethereum.org/downloads/

I can't seem to find the 1.11 on the "Develop builds" section?

it looks empty:

Screen Shot 2022-10-18 at 2 17 07 AM

When will the 1.11 be released into stable?

holiman commented 1 year ago

I can't seem to find the 1.11 on the "Develop builds" section?

Hm, yes, that's some bug in our pipeline. We'll look into it.

holiman commented 1 year ago

Fixed, macos binaries available again

felix-halim commented 1 year ago

It is quite frustrating experience when using geth to sync from scratch without any indicator of when it will complete. The logs did print the number of accounts: mine showing about 1M accounts, however AFAIK there are 200M accounts, and I've been syncing for 4+ days nonstop. At this rate, it will take 1+ years to fully sync. I am not comfortable of using the unstable 1.11 version. I don't know what other issues it will have? I don't want to risk it for now.

So I gave besu a try, syncing from scratch using checkpoint, and it finished in a bit less than 2 days using this command:

besu \
--sync-mode=X_CHECKPOINT \
--data-storage-format=BONSAI \
--data-path=besu-data-checkpoint \
--rpc-http-enabled \
--engine-jwt-enabled=true \
--engine-jwt-secret=/path/to/jwt.hex  \
--engine-host-allowlist="*" \
--metrics-enabled \
--Xplugin-rocksdb-high-spec-enabled

The logs is also very clear, it shows:

2022-10-20 07:25:35.524-07:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 15342826 world state nodes

The current number of nodes is 15790393, that is when I know it's near finish.

Compare that with the log from geth:

INFO [10-18|00:41:59.699] State heal in progress accounts=1,800,572@96.80MiB slots=4,221,205@308.02MiB codes=2596@19.81MiB nodes=21,541,853@5.58GiB pending=26107

The number of nodes is 21,541,853? Perhaps it's referring to another data structure? I don't know how to interpret that to estimate the completion time?

I hope my feedback can make geth logs better.

holiman commented 1 year ago

The logs did print the number of accounts: mine showing about 1M accounts, however AFAIK there are 200M accounts, and I've been syncing for 4+ days nonstop. At this rate, it will take 1+ years to fully sync.

As I have already said: There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric. You cannot extrapolate the remaining state-heal time based on those numbers.

So I gave besu a try, syncing from scratch using checkpoint, and it finished in a bit less than 2 days

Good to hear!

I don't know how to interpret that to estimate the completion time?

No, as I've said, that's not the point. If it could be done, we would be doing it already, and show you the estimate.

plansten commented 1 year ago

I have a similar problem. When are you plan to add this fix to ethereum/client-go stable docker tag?

ethereum / go-ethereum

Geth refuses to sync after being a few months offline #25730