Unable to sync with geth 1.8.1

Fargusson commented 6 years ago

Hi there,

System information

Geth version: 1.8.1 OS & Version: Windows 10 Hardware: Personal laptop with 256GB SSD used for storing the chaindata folder, 16GB RAM and a Pentium i7.

Expected behaviour

Did a clean install of Mist right after the Iceberg update. Resynced from scratch hoping to be all setup after a few hours (it took 3-6 hours with geth 1.7.x, which I did several times because my SSD is also my system disk so I've got only around 100GB left which means I had to do a fresh "fast" sync every two weeks).

Actual behaviour

Geth seems to be unable to switch to --syncmode "full". Stop downloading blocks somewhere around 100 blocks before the last block, stop for an hour, start downloading the chain structure, download a dozen new blocks in "fast" mode, stop for an hour, Loop.

See the attached log file from one of my last runs which lasted 2 days. Hope it will help to find where the problem is.

It seems that I get in sync with the network with a some bugs when launching geth from a console forcing --syncmode "full" (see attached print screen) but when I reopen Mist I come back into the syncing loop I described above and all the full blocks I downloaded with geth only seem to be gone (see the other attached log below).

Closing/reopening Mist or geth multiple times was also unsuccessful.

Steps to reproduce the behaviour

I got no idea but as it seems not everyone is having the same matter, I'll bet it's pretty random.

Backtrace

Geth log from a two days try

geth_prtscr

Geth log from my latest try

nikashitsa commented 6 years ago

I have same issue

pedronfigueiredo commented 6 years ago

I'm having the same issue. Removed db several times and tried --syncmode="fast" and --syncmode="light" without success

cezekwe commented 6 years ago

I had a similar issue and found that I needed to change my firewall to allow inbound/outbound traffic on 30303. Looking at your log is seems like you don't connect to anymore than 2-3 peers at a time.

Once I changed my firewall settings, my node started syncing MUCH faster. Although now I am still 65 blocks behind the mainnet.

In the article for Iceberg's release (https://blog.ethereum.org/2018/02/14/geth-1-8-iceberg%C2%B9/), the author said that they were able to complete a fast sync in ~3 hours using an m4.2xlarge instance. Could you guys (devs) tell me what command line options you used for that test? I am running on similar hardware and my geth node has been 65 blocks behind the mainnet for the past 2 days.

Fargusson commented 6 years ago

I also gave full permission for geth.exe on my firewall and I opened the 30303 UDP and TCP ports manually in my router. So it doesn't seem to be the issue here unfortunately.

EDIT: I'll link your post to mine if you don't mind.

cezekwe commented 6 years ago

Can you attach to your running geth node and verify that you have 20+ peers (in the console type admin.peers.length)? I swear your logs looked exactly like mine until I realized my firewall was getting in the way.

remyroy commented 6 years ago

I can confirm this.

I'm having a similar issue as well. I've been trying a few times now to get a full sync with 1.8.1 but I'm always behind. Deleted my whole database many times and restarted my node many times and I still cannot get a full sync. Peer count seems quite low (7 and 5 the last two time I checked).

I will do some more testing with my firewalls/router and see if I can get more peers.

mikerowehl commented 6 years ago

I'm having a very similar issue with 1.8.1-stable on Ubuntu running on AWS. I've tried both restarting geth and letting it pick back up, and blowing away the whole chain and trying fresh - and I've tried both a few times each. My logs always end up eventually looking like the excerpt in this gist. I'm not sure I'm reading that right, but it looks like the number of pending state entries gets down to 0 and then pops back up to a few tens of thousands. There's dropped peer notices in there, but looking at the code it seems like those would just happen whenever a node goes offline. Not something to worry about?

When I'm trying to fast sync I have plenty of peers, but eth.syncing always shows a sync status and the eth.blockNumber is always 0:

> net.peerCount
25
> eth.blockNumber
0
> eth.syncing
{
  currentBlock: 5175777,
  highestBlock: 5175842,
  knownStates: 22952,
  pulledStates: 11345,
  startingBlock: 5175723
}

That console output is from right after I had restarted geth, which is why the knownStates and pulledStates values are so low. It has been running for about 7 hours with the numbers up around 92 million, but not finishing. So I tried restarting the process again.

This is on a pretty decent AWS machine (t2.xlarge) with a gp2 SSD and 30303 TCP and UDP open to the world. I've seen posts where people say they've used a lot less horsepower to sync, so I assume that's not the problem. But how can I tell?

I'm able to do a light sync on machines with the same setup and have no issues. But syncmode fast I haven't been able to get to complete yet.

Fargusson commented 6 years ago

Ok as @cue0083 advised I runned geth console 2> geth.log

As expected, I got between 21 and 23 peers, it downloaded quickly a few blocks in "fast" mode (and then stopped 65 blocks from the highest) while syncing up the chainstates from scratch (again). I'll let that run for a while and see what it gives me after a few hours downloading the chainstates.

EDIT: As expected after some times I started dropping some peers (going to 3-4 peers) and eth.syncing returned false. See the attached log here

remyroy commented 6 years ago

I configured my router to port forward 30303 and I got more peers with 1.8.1-stable on Windows but I still couldn't fully sync to the latest head. I was always a few blocks behind it seems with eth.blockNumber staying at 0 similar to @mikerowehl . I feel like this might have to do with the parameters I was using to run geth. Since 1.8 release I've been starting geth with a similar command:

geth --ipcdisable --verbosity 2 console

I tried just starting geth with all default parameters

geth

And I got a synced database with eth.blockNumber == 5177567 now. I restarted geth with my original parameters and it seems like the issue is solved. I can see that eth.blockNumber is slowly growing and matching public blockchain explorers.

A temporary workaround for this issue might be to start geth with all the default parameters until you get a fully synced database and restart it with your own parameters once that is done.

cezekwe commented 6 years ago

@remyroy I will give that a try! When you started geth with default parameters, did you start syncing with an empty DB?

remyroy commented 6 years ago

@cue0083 No, I started after I tried syncing and failed. So the state of my database wasn't empty or clean, but my guess is that it should also work with an empty or clean starting state.

cezekwe commented 6 years ago

@remyroy How long did it take for you to sync after running with the default parameters?

remyroy commented 6 years ago

@cue0083 I don't know exactly how long it took. I left my node running overnight. I came back and it was synced. So around 7 hours or less from the point where I already started the syncing process and failed it.

mikerowehl commented 6 years ago

Hi folks, thanks for the extra info. I'm pretty sure running without parameters is the same as running with syncmode fast, looking at the code it seems to be the default: https://github.com/ethereum/go-ethereum/blob/55599ee95d4151a2502465e0afc7c47bd1acba77/eth/config.go#L38

I'm giving it a try however, thanks for the suggestion @remyroy.

> eth.syncing
{
  currentBlock: 5177635,
  highestBlock: 5177700,
  knownStates: 15606,
  pulledStates: 6265,
  startingBlock: 5177622
}

Always 65 blocks behind, like I've seen others mention. I'll leave it running for a while .

karalabe commented 6 years ago

So, fast sync works by downloading all the blocks and only verifying the proof-of-work, and then it downloads the state trie separately. The state trie grew enormous in the last year, so downloading that takes a lot longer than downloading the blocks.

The reason all of you are seeing that geth is stuck, is because it already has all the blocks, but it still busy downloading the state. Unfortunately Ethereum's trie design doesn't allow knowing in advance how much state data there is, we only operate on cryptographic hashes and finish when no more references are missing. This is why we can't show a progress bar.

karalabe commented 6 years ago

@mikerowehl Downloading the state trie and storing it locally can take half a day on a consumer laptop + ssd.

mikerowehl commented 6 years ago

Thanks @karalabe. I saw the trie count (I'm assuming that's the "Imported new state entries" lines) get up over 90 million and then seemed to grow less quickly. This is on a t2.xlarge AWS machine with SSD running only geth, and it was running for over 24 hours.

Is there any way to estimate that trie data count? I saw logs from folks that seemed to have synced during Feb and had 80 million states, so when I got up over 90 I thought I had to be at or close to the end.

cezekwe commented 6 years ago

@karalabe Is it possible for us to see the total amount of state data that a geth node has downloaded? It seems like the count gets restarted each time we restart geth.

karalabe commented 6 years ago

Unfortunately we don't have a meaningful way to estimate it. I'm pondering about a new sync model that would not download the trie by individual nodes, rather by accounts (those we could estimate) and then fill some gaps via nodes. That's for Geth 1.9 to soonest though as it needs some data modifications and protocol updates.

karalabe commented 6 years ago

@cue0083 We don't save the progress counter currently, that's why it seems to restart from 0, but it doesn't even download duplicates. I should really save the counter somehwere to make in nicer.

mikerowehl commented 6 years ago

I see, thanks for the info. I've started up my fast sync again with the cache cranked up and I'll leave it running for a few days. Appreciate you taking the time to respond.

nikashitsa commented 6 years ago

I made ports open and got sync in around 11 hours. Huh... Finally...

cezekwe commented 6 years ago

@karalabe If someone like @nikashitsa shared the size of their chaindata folder would that give those of us still syncing an idea of how much more data we have to pull?

Fargusson commented 6 years ago

Thanks @karalabe for your answers. I do understand that it takes some more times to download the state trie than to download the blocks in "fast" mode. But I was able to download it within around 6 hours with Geth 1.7 and now I'm not even capable doing it in 2 days with Geth 1.8 and the exact same config.

Any idea why ?

nikashitsa commented 6 years ago

@cue0083 ~ 67GB

iwooden commented 6 years ago

@karalabe I don't think this has to do with state trie size. We have many customers that are deploying private networks using geth 1.8.1 and are running into this same exact issue - syncing to a network with a few hundred blocks shouldn't take a day. It's transient and doesn't happen on all deployments, but I'd say ~60% of deployments fail to sync with logs that are exactly like the OP.

If the solution mentioned above of making sure your firewall settings are set up so that you can connect to 20+ peers is the only way to sync properly, then clearly this won't work for private network models where we very commonly have 3 to 5 nodes running.

I believe that something was changed in the sync code in 1.8.0 or 1.8.1 that is causing this issue. Our current workaround is locking our deployments to 1.7.2 - this seems to work fine, but obviously we'd like to upgrade when syncing works.

karalabe commented 6 years ago

@iwooden Could you share your setup? Or some means for us to reproduce the issue?

iwooden commented 6 years ago

Sure, let's say we have 3 VMs running with geth 1.8.1 installed, where each node has the other two nodes' enode URL in ~/.geth/static-nodes.json, started with the following options:

geth --datadir /home/eth-node/.geth --identity <node hostname> --rpc --rpcport 8545 --rpcaddr 0.0.0.0 --port 30303 --ethstats <ethstats URL> --nodiscover --networkid 12345

Turn on mining on two of the nodes, and let it mine a few hundred blocks or so. Next, spin up a new VM, place the non-mining node's enode URL into ~/.geth/static-nodes.json, and run it with the same options as above. This is a bit contrived because the 4th node will only have one peer, but it should repro. As I said above, the issue is transient, so if it doesn't happen on the first deployment, delete the chaindata folders on the 4th node and try again.

mikerowehl commented 6 years ago

I have 67GB of data pulled down:

ubuntu@ip-172-31-44-245:~$ du -sh .ethereum/
67G .ethereum/

And my node now says it's not syncing, but also doesn't return a block number and is returning 0 for balance on my address:

> net.peerCount
23
> eth.blockNumber
0
> eth.syncing
false
> eth.getBalance("0x9c9234420501c820f73cfa44e822d27e1baab4d5")
0

I'll leave this one running to see if it ever gets to a sane state. But in the meantime going to try out one of the 1.7 releases to see if I can get a full sync working with that.

@karalabe My node that isn't syncing is a test machine running on AWS. If this isn't something you've been able to reproduce let me know and send me an SSH public. Happy to add you to the machine so you can get on and investigate.

karalabe commented 6 years ago

@mikerowehl what are your latest logs?

mikerowehl commented 6 years ago

@karalabe Here's the most recent chunk: https://gist.github.com/mikerowehl/bcff1d20bedf5e2e6a3af0bd15555fa9

karalabe commented 6 years ago

Yeah, so it's downloading the state trie. That can take up to even half a day, north of 85-90M entries.

mikerowehl commented 6 years ago

Okay, I'll let it keep rolling then.

iwooden commented 6 years ago

To clarify, the log entries that I'm seeing are not like mikerowehl's. We're seeing the "state node failed for all peers" issue referenced in the OP, repeating for hours.

pedronfigueiredo commented 6 years ago

Machine: MacBook Air (13-inch, Mid 2012) with 128Gb SSD OS: macOS 10.13.3 (17D102) Version: geth 1.8.1-stable

> geth --rinkeby --syncmode="light"

Syncing gets stuck in

INFO [03-02|13:42:47] Maximum peer count                       ETH=0 LES=100 total=25
INFO [03-02|13:42:47] Starting peer-to-peer node               instance=Geth/v1.8.1-stable/darwin-amd64/go1.10
INFO [03-02|13:42:47] Allocated cache and file handles         database=/Users/pedro/Library/Ethereum/rinkeby/geth/lightchaindata cache=768 handles=128
INFO [03-02|13:42:47] Persisted trie from memory database      nodes=355 size=65.27kB time=696.614µs gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [03-02|13:42:47] Initialised chain configuration          config="{ChainID: 4 Homestead: 1 DAO: <nil> DAOSupport: true EIP150: 2 EIP155: 3 EIP158: 3 Byzantium: 1035301 Engine: clique}"
INFO [03-02|13:42:47] Loaded most recent local header          number=0 hash=6341fd…67e177 td=1
INFO [03-02|13:42:47] Starting P2P networking 
INFO [03-02|13:42:50] UDP listener up                          net=enode://f9c1c399c9ddc8d4971d4a24c69438f0f04ccbbe478b4408e8b74aad15d1aa2fe89334cf2170846481d15dd473db69557e2c0477eb5b292513ef688e89ee5117@[::]:30303
WARN [03-02|13:42:50] Light client mode is an experimental feature 
INFO [03-02|13:42:50] RLPx listener up                         self="enode://f9c1c399c9ddc8d4971d4a24c69438f0f04ccbbe478b4408e8b74aad15d1aa2fe89334cf2170846481d15dd473db69557e2c0477eb5b292513ef688e89ee5117@[::]:30303?discport=0"
INFO [03-02|13:42:50] IPC endpoint opened                      url=/Users/pedro/Library/Ethereum/rinkeby/geth.ipc

mikerowehl commented 6 years ago

It took a long time, almost 2 days, but this morning my 1.8.1-stable node finished fast sync.

pedronfigueiredo commented 6 years ago

No success here yet, it's not moving past IPC endpoint opened... 🤔

Fargusson commented 6 years ago

Hi everyone.

I finally managed to get synced today after four days which is, I say it again, quite odd given that it used to be between 6-12 hours with Geth 1.7.x.

You'll find all the logs of my recent try from block 0 to block 5,226,862. I had to restart a few times, among which one for Geth 1.8.2 upgrade: that's why you'll find 4 logs.

For some reason, the "err="state node ff049a…5485bd failed with all peers (X tries, X peers)" WARN" appears to be slowing down the state trie downloading process.

I hope this will help.

Log file from March 5th to March 6th Geth 1.8.2 upgrade

Log file from March 6th am to March 6th pm

Log file from March 6th pm to March 7th pm

Log file from March 7th pm to March 9th pm

Cryptome2 commented 6 years ago

same issue here with 2 new nodes (versions W2010-geth182,W2016server-geth181) : can't get them synced. Already resp 4 and 3 days syncing. . One has reached 110M new state entries. Lot's of failure warnings. Peer dropping. Peercount always between 0 and 7 , mostly between 1 and 3.

Cryptome2 commented 6 years ago

All ok now. enabled Upnp on the modem => 20+ peers

axcrest commented 6 years ago

I am running geth 1.8.2 and tried twice to sync so far. Both times the current block was always 65 behind the highest block. The first time I let it run for 5 hours before giving up. The second time I let it run for 12 hours. It is suspicious that 65 is a magic number right? I am running on a Mac OS 10.12.6 with an SSD drive.

> eth.syncing
{
  currentBlock: 5254610,
  highestBlock: 5254675,
  knownStates: 17559255,
  pulledStates: 17555242,
  startingBlock: 5254452
}

My peer counts after 12 hours:

> net.peerCount
4

> admin.peers.length
4

This is the command I am running (I run from the source code):

build/bin/geth --rpcapi db,eth,net,web3 --rpc console 2>> ../mainNetEth.log
Welcome to the Geth JavaScript console!

instance: Geth/v1.8.2-stable-b8b9f7f4/darwin-amd64/go1.8.3
 modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

bmlyon commented 6 years ago

Thanks Cryptome2, that fixed it for me as well.

Fargusson commented 6 years ago

Hi everyone.

I did a fresh Mist 0.10 update once it was released a few days ago and it's doing much better now. Still had to wait for a day and a half to download the state trie from scratch but compared to the 4+ days it took with Mist 0.9.x it's great.

With Mist 0.9.3 and Geth 1.8.2, I had only 5-7 peers once I synced with the network. With Mist 0.10.0 and Geth 1.8.2, I have 12-15 peers which is perhaps the reason why I'm able to sync faster. So I guess I experience some kind of software issue... But I'm still unable to have 25 peers like back in the ol' days though.

For the record, UPnP is enabled on my router and I manually opened the 30303 TCP/UDP ports. So @Cryptome2 solution wasn't the one for me.

Hope it will be even better with Geth 1.8.3 !

P.s: the light sync mode is unbelievably fast. Kudos to the devs!

kmajor commented 6 years ago

I'd like to add that I thought I was having the same problem as OP, but it turns out I just needed to be patient and it sometimes takes 5-10 minutes before you see some movement. When I sync with 'fast' results start populating right away, so I expected 'light' to exhibit the same behavior.

zhizouxiao commented 6 years ago

I have the exact same problem. Update to go1.10 version, recreate my chain, problem solved.

islishude commented 6 years ago

Same as Geth 1.8.12. full sync mode would be stop syncing sometime.

woei66 commented 6 years ago

Same as Geth 1.8.14-stable-316fc7ec on Mac OS and it can't finish the sync. It always delay 66 blocks. No idea to fix it.

jfraiese commented 6 years ago

Try upgrading to v1.8.14-stable (from 1.8.13). Resizing server to 8cores. Limiting cache to 2048. Stopstarting geth with the sync mode parameter.

karalabe commented 6 years ago

Syncing Ethereum is a pain point for many people, so I'll try to detail what's happening behind the scenes so there might be a bit less confusion.

The current default mode of sync for Geth is called fast sync. Instead of starting from the genesis block and reprocessing all the transactions that ever occurred (which could take weeks), fast sync downloads the blocks, and only verifies the associated proof-of-works. Downloading all the blocks is a straightforward and fast procedure and will relatively quickly reassemble the entire chain.

Many people falsely assume that because they have the blocks, they are in sync. Unfortunately this is not the case, since no transaction was executed, so we do not have any account state available (ie. balances, nonces, smart contract code and data). These need to be downloaded separately and cross checked with the latest blocks. This phase is called the state trie download and it actually runs concurrently with the block downloads; alas it take a lot longer nowadays than downloading the blocks.

So, what's the state trie? In the Ethereum mainnet, there are a ton of accounts already, which track the balance, nonce, etc of each user/contract. The accounts themselves are however insufficient to run a node, they need to be cryptographically linked to each block so that nodes can actually verify that the account's are not tampered with. This cryptographic linking is done by creating a tree data structure above the accounts, each level aggregating the layer below it into an ever smaller layer, until you reach the single root. This gigantic data structure containing all the accounts and the intermediate cryptographic proofs is called the state trie.

Ok, so why does this pose a problem? This trie data structure is an intricate interlink of hundreds of millions of tiny cryptographic proofs (trie nodes). To truly have a synchronized node, you need to download all the account data, as well as all the tiny cryptographic proofs to verify that noone in the network is trying to cheat you. This itself is already a crazy number of data items. The part where it gets even messier is that this data is constantly morphing: at every block (15s), about 1000 nodes are deleted from this trie and about 2000 new ones are added. This means your node needs to synchronize a dataset that is changing 200 times per second. The worst part is that while you are synchronizing, the network is moving forward, and state that you begun to download might disappear while you're downloading, so your node needs to constantly follow the network while trying to gather all the recent data. But until you actually do gather all the data, your local node is not usable since it cannot cryptographically prove anything about any accounts.

If you see that you are 64 blocks behind mainnet, you aren't yet synchronized, not even close. You are just done with the block download phase and still running the state downloads. You can see this yourself via the seemingly endless Imported state entries [...] stream of logs. You'll need to wait that out too before your node comes truly online.

Q: The node just hangs on importing state enties?!

A: The node doesn't hang, it just doesn't know how large the state trie is in advance so it keeps on going and going and going until it discovers and downloads the entire thing.

The reason is that a block in Ethereum only contains the state root, a single hash of the root node. When the node begins synchronizing, it knows about exactly 1 node and tries to download it. That node, can refer up to 16 new nodes, so in the next step, we'll know about 16 new nodes and try to download those. As we go along the download, most of the nodes will reference new ones that we didn't know about until then. This is why you might be tempted to think it's stuck on the same numbers. It is not, rather it's discovering and downloading the trie as it goes along.

Q: I'm stuck at 64 blocks behind mainnet?!

A: As explained above, you are not stuck, just finished with the block download phase, waiting for the state download phase to complete too. This latter phase nowadays take a lot longer than just getting the blocks.

Q: Why does downloading the state take so long, I have good bandwidth?

A: State sync is mostly limited by disk IO, not bandwidth.

The state trie in Ethereum contains hundreds of millions of nodes, most of which take the form of a single hash referencing up to 16 other hashes. This is a horrible way to store data on a disk, because there's almost no structure in it, just random numbers referencing even more random numbers. This makes any underlying database weep, as it cannot optimize storing and looking up the data in any meaningful way.

Not only is storing the data very suboptimal, but due to the 200 modification / second and pruning of past data, we cannot even download it is a properly pre-processed way to make it import faster without the underlying database shuffling it around too much. The end result is that even a fast sync nowadays incurs a huge disk IO cost, which is too much for a mechanical hard drive.

Q: Wait, so I can't run a full node on an HDD?

A: Unfortunately not. Doing a fast sync on an HDD will take more time than you're willing to wait with the current data schema. Even if you do wait it out, an HDD will not be able to keep up with the read/write requirements of transaction processing on mainnet.

You however should be able to run a light client on an HDD with minimal impact on system resources. If you wish to run a full node however, an SSD is your only option.

ethereum / go-ethereum