Closed elsampsa closed 2 years ago
It looks like you were not fully snap synced when you shut down your node the last time. You were most likely still in the state heal phase. Otherwise geth would import blocks one by one afterwards. The problem now is that you have a really old state that is not fully healed, so a lot of accounts, balances etc are different than the current state. In the state healing phase these states are fetched one by one from the network. Since they are so many differences between you and the network, the state heal phase will likely never finish. The best way to fix this is to stop geth and remove the old state that you have (via geth removedb
y
'n`). This will force the node to start a new snap sync which should finish in a couple of hours.
Regarding your PS: nodes can recover very quickly from a couple of days of downtime once they are fully synced (and healed). Since the gas limit in ethereum was voted up and is pretty high atm, the nodes have to do a lot of computation, especially if they want to sync up to the network
Thanks for the reply. But I am 100% sure that it synced.. checked it form the geth console, etc.
I guess there's nothing else left to do than to do yet another snap sync & I keep this ticket still open a few days just in case..
To confirm.. once the snap sync is over, after that geth will behave like a normal "full" node.. there's no condition/bug? under which it would start a snap sync again. And it's pitty no-one never explains that "Unexpected bytecode heal packet"..!
Yes a node will never start a sync again after its been synced once The "unexpected bytecode heal packet" is printed either when a peer sends us data that we haven't requested or if a reply arrives after the request already timed out (so you request something, the peer takes too long to respond, you stop tracking the request, the peer finally responds)
I'm running
geth removedb
Should I remove ancient database as well..?
I keps the ancient database & now it seems to work like a full node..?
syys 12 08:23:08 sampsa-asus geth[2567]: INFO [09-12|08:23:08.248] Imported new block receipts count=18 elapsed=135.821ms number=14,627,793 hash=769804..0be43a age=4mo3w2d
....
syys 12 08:23:03 sampsa-asus geth[2567]: WARN [09-12|08:23:03.989] Pivot seemingly stale, moving old=15,519,069 new=15,519,133
State heal logs missing at the moment
I've spent some time looking around and trying to understand the differences between geth's sync modes but am still left with some open questions. For example: Does a geth node that either snap or fast sync's initially, end up with the same database state as a "full node" that has executed all of the transactions since the genesis block? Or does snap sync involve only acquiring a "digest" of the storage state of all addresses as of the sync point, such that future requests made to this node for storage as of some older block, would then involve the node making new requests to peers to acquire that older state? Please forgive me if any of this is malformed. I'm asking these questions having made some possibly incorrect assumptions about "everything."
Having just written these questions, I thought of a better google query, and found this: https://github.com/ethereum/go-ethereum/issues/24413
Probably a good idea to incorporate some/all of the wisdom there, into the docs.
After doing ``geth removedb" (but keeping ancient database/freezer) now it's in the state heal process again. It downloaded all missing blocks and after that "state heal" logs started to appear.
Now, as always, logs are saturated with:
Unexpected trienode heal packet
State healing is painstakingly slow:
syys 13 16:51:30 sampsa-asus geth[1325]: INFO [09-13|16:51:30.912] State heal in progress accounts=524,748@26.14MiB slots=1,009,090@76.24MiB codes=877@6.82MiB nodes=5,091,192@1.26GiB pending=29246
Maybe 500 000 accounts per 24h ..?
So geth removedb
didn't help. Now what? Should I also remove also the "ancient" database and try to do snap sync?
download recent global state "SC0" using snap
then that global state needs to be "healed" using the downloaded blocks "SC0" --> blocks --> "SC1" (current global state)
Nobody knows why.. just (clueless) nerds (like myself) trying to scrap information from the internet. Do you have any "testing" lab where you put clients on & offline & try to disturb them in some private geth ethereum network? Or does "ci/cd testing" equal to "merge" (I hope not).
EDIT: maybe the ancient database is corrupt..? Does the ancient database include a state trie that is loaded in the beginning & then it tries to heal it..
..or is the latest state fetched with snap sync if I have done geth removedb
(but not removed the ancient database)
On a related note, I ended up with the impression that it has become quite difficult to stand up a full node in lieu of using centralized providers such as alchemy/moralis/infura etc because of these types of troubles (and perhaps my last attempt was when snap sync was too new). That slowness and/or troubles in this area ends up pushing dApp providers to rely on centralized resources such as Infura/Moralis/Alchemy (vs it being super easy to stand up their own nodes) may be an issue to think about (pros and cons to each).
It would be really nice to be able to stand up a full node that can answer historical requests (gcmode archive) quickly by acquiring that database and/or being able to forward such queries to/from peers, and then verifying it on a best-efforts basis over a longer timeframe. I'm not fully conversant on the security implications and/or implementation difficulties and/or bandwidth/resource management issues of doing such a thing -- but I sense it would help reduce centralization onto a handful of dApp service providers.
Similar scenarios in my node as well. I have 2 geth archival nodes connected to 2 beacon nodes and both seems to have completed merge. But the nodes seems to be syncing very slowly. PFB the status from both nodes
GETH node 0 - INFO [09-21|06:25:06.231] Imported new potential chain segment blocks=1 txs=295 mgas=22.221 elapsed=9.974s mgasps=2.228 number=15,559,291 hash=0961a7..1d9571 age=2d21h47m dirty=0.00B node 1 - INFO [09-21|06:25:07.196] Imported new potential chain segment blocks=1 txs=371 mgas=27.281 elapsed=9.188s mgasps=2.969 number=15,557,806 hash=9b6067..887a31 age=3d2h46m dirty=0.00B
BEACON node 0 - time="2022-09-21 06:26:27" level=info msg="Processing block batch of size 64 starting from 0xb47a08d1... 4722176/4743130 - estimated time remaining 15m39s" blocksPerSecond=22.3 peers=49 prefix=initial-sync node 1 - time="2022-09-21 06:25:20" level=info msg="Processing block batch of size 64 starting from 0x2ff82104... 4720690/4743124 - estimated time remaining 29m19s" blocksPerSecond=12.8 peers=45 prefix=initial-sync
The slots(3 days behind current slot) in beacon seems to be in sync with the block shown in geth node. Can anyone confirm if this is healthy and our sync is well and good but a bit slow ?
Now, as always, logs are saturated with:
Unexpected trienode heal packet
This has actually been fixed, but not pushed to stable
(a.k.a v1.10.xx
). If you use latest
(a.k.a 1.11.xx -unstable
) you should not see these messages (as much). Several issues related to snap-sync were fixed, and in many cases that were previously unable to sync due to IO performance is now made possible.
So, I recommend switching geth-version.
So geth removedb didn't help. Now what? Should I also remove also the "ancient" database and try to do snap sync?
I recommend
geth removedb
but keep ancients intact. @Sreeram1993
node 0 - INFO [09-21|06:25:06.231] Imported new potential chain segment blocks=1 txs=295 mgas=22.221 elapsed=9.974s mgasps=2.228 number=15,559,291 hash=0961a7..1d9571 age=2d21h47m dirty=0.00B
node 1 - INFO [09-21|06:25:07.196] Imported new potential chain segment blocks=1 txs=371 mgas=27.281 elapsed=9.188s mgasps=2.969 number=15,557,806 hash=9b6067..887a31 age=3d2h46m dirty=0.00B
These numbers aren't great -- as you can sees, importing one block takes almost 10s
. That is very close to the block-time. Obviously, if import is slower than block-time, then you will not be able to keep up. Other than that, yeah, looks ok to me
Oh, btw -- is there really a point in keeping this bug open? Seems like the discussion has trailed off into new areas?
Thanks for the update @holiman Is there anyway to increase the import time, I am already using an 16TB SSD and the nodes are in AKS as well. Any sol to increase peers or network bandwidth so that archival node stays upto date ?
Increasing peers will most likely have a negative impact. I don't know what AKS means.
If you are interested in trying something new, we have a new database backend which we think will be an improvement: https://github.com/ethereum/go-ethereum/pull/24615 . Unfortunately, we don't have any conversion-tool, so the only thing one can do with it is to sync from scratch -- which is a PITA with archive node, I know, so I don't actually expect you to try it out.
Other than that, I don't have a lot of ideas.
@holiman Got it fixed and I started running. But I get this INFO for blocks multiple times for a same block.
--GETH-- INFO [09-22|13:01:36.034] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=1m19.317s mgasps=0.317 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B INFO [09-22|13:01:36.176] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=142.118ms mgasps=176.686 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B INFO [09-22|13:01:36.318] Imported new potential chain segment blocks=1 txs=210 mgas=25.110 elapsed=141.422ms mgasps=177.556 number=15,588,279 hash=461b03..8964d1 age=2h34m37s dirty=0.00B --GETH-- Certain blocks show import just once while some shows multiple info messages like above. Is it corrupting the node ?
I have the same issue after upgrading geth syncmode from Full to Snap. 12 Hours later it is still syncing with no clear indication on when or if the process will end. There are some posts that the parameter "accounts" should reach the total number of Ethereum accounts (201M today) for the snap sync process to end. Which in this case, means it will never ends. Or shall I simply be more patient?
Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889
In an attempt to follow @holiman recommendations, I failed to locate the latest version of geth (a.k.a 1.11.xx -unstable). The last version published on github is the 1.10.25. A google seach on "go-ethereum v1.11.1" is not helpful either. Any help is welcome
All it took was enough patience. Snap Sync finalized after 20 hours on a GCloud 16vCPU (e2-standard-16) + Balanced Persistent Disks. It will be downgraded to e2-standard-4 now that it is fully synced. Contrary to what is reported in other posts the parameter accounts in the log file below does not need to reach the total number of Ethereum accounts for the snap sync process to finalize.
Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889
@Yvelo The issue is mine is an archival sync node :D So I think I need a lot more patience. Happily it has reached 6 hours behind
I failed to locate the latest version of geth (a.k.a 1.11.xx -unstable). The last version published on github is the 1.10.25. A google seach on "go-ethereum v1.11.1" is not helpful either. Any help is welcome
Didn't see this until now. On any of our "Releases" pages, e.g. https://github.com/ethereum/go-ethereum/releases/tag/v1.10.25 , we link to
Pre-built binaries for all platforms on our downloads page.
On that download-page, we have 'stable' and 'develop' builds.
I am doing sync from scratch with geth version: 1.10.25-stable
With this command line:
geth --cache=12048 --http --datadir /path/to/geth/snap-data --authrpc.jwtsecret /path/to/jwt.hex
The datadir is now about 605G, and the "State heal in progress" started about 2 days ago, and now the logs is showing:
INFO [10-18|00:41:59.699] State heal in progress accounts=1,800,572@96.80MiB slots=4,221,205@308.02MiB codes=2596@19.81MiB nodes=21,541,853@5.58GiB pending=26107
So it is progressing about 1M accounts per day? At this rate, I need to wait 200 days for it to be fully sync to 200M accounts?
@Yvelo can you share what is the number of account shown in your logs when it's fully sync? is it around 169K accounts?
What indicator in the log that can be used to estimate the time for the "State heal" to finish?
If I run this command:
geth attach http://localhost:8545
and type:
> eth.syncing
{
currentBlock: 15773502,
healedBytecodeBytes: 20841216,
healedBytecodes: 2605,
healedTrienodeBytes: 6011477028,
healedTrienodes: 21616064,
healingBytecode: 0,
healingTrienodes: 1153,
highestBlock: 15773610,
startingBlock: 0,
syncedAccountBytes: 45757800173,
syncedAccounts: 192540876,
syncedBytecodeBytes: 4627746078,
syncedBytecodes: 700408,
syncedStorage: 814610039,
syncedStorageBytes: 174772608597
}
There is no saying what is the "highest" number of trie nodes are there?
So it is progressing about 1M accounts per day? At this rate, I need to wait 200 days for it to be fully sync to 200M accounts?
No, that's not how it works. But I'm curious, why are you using 1.10.25
? As I've recommended above, please use latest
a.k.a 1.11.xx
which has a lot of snap-sync related fixes.
There is no saying what is the "highest" number of trie nodes are there?
There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric.
It's like two people walking upwards in an escalator going downwards: compare how many steps they take to reach the top. But they walk at different speeds. And one started near the bottom, another close to the top.
Thanks for the prompt reply, @holiman!
I used 1.10.25 because I started using geth around 5 days ago without knowing about this bug. I just I read about this bug now. Hearing that I should delete everything and start from scratch again feels discouraging.
If I were to use 1.11.xx, what is the expected time to sync from scratch to full? Should I use the same command line as above? or is there a faster way to do it? perhaps like a checkpoint sync?
I am wondering, so there is no good metric at all to tell what is the ETA for the state heal? I don't need to compare with other people if the log shows any indicator about the ETA ...
Hearing that I should delete everything and start from scratch again feels discouraging.
Just switch then, you don't have to restart from scratch.
If I were to use 1.11.xx, what is the expected time to sync from scratch to full? Should I use the same command line as above? or is there a faster way to do it? perhaps like a checkpoint sync?
A couple of days. Yes. No. No.
I am wondering, so there is no good metric at all to tell what is the ETA for the state heal? I don't need to compare with other people if the log shows any indicator about the ETA ...
No. If there were, you can rest assured we would have added it and displayed it promptly to the users. It's something that users have requested consistently since 2016.
I cannot install 1.11.xx from brew on MacOS using this doc:
https://geth.ethereum.org/docs/install-and-build/installing-geth
Running:
brew install ethereum --devel
I got:
Error: invalid option: --devel
If I go to the downloads page:
https://geth.ethereum.org/downloads/
I can't seem to find the 1.11 on the "Develop builds" section?
it looks empty:
When will the 1.11 be released into stable?
I can't seem to find the 1.11 on the "Develop builds" section?
Hm, yes, that's some bug in our pipeline. We'll look into it.
Fixed, macos binaries available again
It is quite frustrating experience when using geth to sync from scratch without any indicator of when it will complete. The logs did print the number of accounts: mine showing about 1M accounts, however AFAIK there are 200M accounts, and I've been syncing for 4+ days nonstop. At this rate, it will take 1+ years to fully sync. I am not comfortable of using the unstable 1.11 version. I don't know what other issues it will have? I don't want to risk it for now.
So I gave besu a try, syncing from scratch using checkpoint, and it finished in a bit less than 2 days using this command:
besu \
--sync-mode=X_CHECKPOINT \
--data-storage-format=BONSAI \
--data-path=besu-data-checkpoint \
--rpc-http-enabled \
--engine-jwt-enabled=true \
--engine-jwt-secret=/path/to/jwt.hex \
--engine-host-allowlist="*" \
--metrics-enabled \
--Xplugin-rocksdb-high-spec-enabled
The logs is also very clear, it shows:
2022-10-20 07:25:35.524-07:00 | EthScheduler-Services-26 (batchPersistTrieNodeData) | INFO | SnapsyncMetricsManager | Healed 15342826 world state nodes
The current number of nodes is 15790393, that is when I know it's near finish.
Compare that with the log from geth:
INFO [10-18|00:41:59.699] State heal in progress accounts=1,800,572@96.80MiB slots=4,221,205@308.02MiB codes=2596@19.81MiB nodes=21,541,853@5.58GiB pending=26107
The number of nodes is 21,541,853
? Perhaps it's referring to another data structure? I don't know how to interpret that to estimate the completion time?
I hope my feedback can make geth logs better.
The logs did print the number of accounts: mine showing about 1M accounts, however AFAIK there are 200M accounts, and I've been syncing for 4+ days nonstop. At this rate, it will take 1+ years to fully sync.
As I have already said: There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric. You cannot extrapolate the remaining state-heal time based on those numbers.
So I gave besu a try, syncing from scratch using checkpoint, and it finished in a bit less than 2 days
Good to hear!
I don't know how to interpret that to estimate the completion time?
No, as I've said, that's not the point. If it could be done, we would be doing it already, and show you the estimate.
I have a similar problem. When are you plan to add this fix to ethereum/client-go stable docker tag?
Weird behaviour from geth when putting a client online after it's been offline for a few months. I would really appreciate if someone would explain me what's going on / put me on the right track.
So far happened:
I am using the "--syncmode=snap" option. When "--syncmode=snap" is enabled, I have to admit I am not sure which one of the two ways, (A) or (B), geth is syncing:
A. The full-node way:
B. Snapshot download:
Is there some criteria which one of the two ways geth ends up doing? Say, if state is enough old, then do (B) instead of (A)?
Obviously, I would prefer (A), since I'm not syncing from scratch. Should I then remove "--syncmode=snap" ?
Next, let's take a look at the logs:
Let's continue:
The client was running ~ 4 days until I got fed up and put it offline. Got this far:
I interprete that downloaded and verified blocks are ok (it seem to reach the latest block occasionally), but the client is unable to reconstruct the up-to-date world state "SC".
I don't understand why. Once the block is downloaded and verified you get the state change, so what is interfering here with the state update/reconstruction?
Is there some competition/contradiction going on between (A) and (B)? (see above)
Some details on my system: 8 x Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz 2TB SSD drive: "Crucial BX500" 12 GB RAM (& all the swap you could wish for) (as mentioned, this was able to snapsync from scratch before summer)
P. S. It would be important to be able to contribute to the network without being a 24/7 avail/on-call eth-client-massage service provider.. to be offline if needed, etc. Imagine that I'd done staking at this point..!