Geth forgetting peers list

msbachler commented 8 years ago

My colleague and I are running a two node private network to testing and development. We are using instance: "Geth/v1.3.3/linux/go1.5.1". We each setup a static-node.json file each pointing to the other node. When we start geth everything works fine and they mine away happily taking turns. But if we leave it running for a day or two it seems that geth forgets its peer information and so stops syncing. I have tested this by typing admin.peers at the geth console and I see the peer information listed. Then later when I can see they are out of sync completely I type admin.peers again and it returns an empty array. Geth has been running the whole time but something has caused it to forget/empty the peers array. This has happened many times now. Each time we have to reboot the server restarting geth and let them sync again and then a day or two later it happens again. It's quite annoying. I assume it is a bug somewhere? Any help would be greatly appreciated.

karalabe commented 8 years ago

We've heard a few reports list this, always using private networks and not the main/test nets. I have an idea of what might cause this. We have a feature in the network protocol that forbids remote connections more frequently than 30 seconds after a connect/drop.

My hunch is that with a 2 node network, if the connection drops for some reason, then the dropped peer will try to reconnect, but fail because the remote side doesn't accept it yet (it just dropped it), and when the remote side want's to connect, the first node doesn't accept since the 30 sec cooldown hasn't passed yet.

If this is the error, it should be fixable easily enough. I'll take a look. Thanks for the detailed report!

karalabe commented 8 years ago

Hmm, we can't really seem to reproduce the problem. If this happens consistently for you, could you try running develop for a few days and when it appears, enable logging on both servers via debug.verbosity(6) and send us the logs after a few minutes?

msbachler commented 8 years ago

Hi!

Thanks for looking at this.

We are currently running the stable version, because when we ran development we had issues (can’t recall what), and where not sure if they were due to bugs in development so we reverted to stable and started our blockchain again. So I am slightly weary of updating to development again and introducing more issues. But I guess is our current issue is solved for certain in the development version that might be a reason to update. Can we change the verbosity logging to 6 on our stable versions and send you the logs of those? I currently have verbosity set to 4 and I think Kevin does too on his server. I currently have a massive log file from when we started our new blockchain. Not sure if verbosity 4 is enough to be worth sending you our current logs?

Thanks

Michelle

From: Péter Szilágyi [mailto:notifications@github.com] Sent: 09 March 2016 11:38 To: ethereum/go-ethereum go-ethereum@noreply.github.com Cc: Michelle.Bachler michelle.bachler@open.ac.uk Subject: Re: [go-ethereum] Geth forgetting peers list (#2250)

Hmm, we can't really seem to reproduce the problem. If this happens consistently for you, could you try running develop for a few days and when it appears, enable logging on both servers via debug.verbosity(6) and send us the logs after a few minutes?

— Reply to this email directly or view it on GitHubhttps://github.com/ethereum/go-ethereum/issues/2250#issuecomment-194255868.

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

msbachler commented 8 years ago

Hi!

Just talked to Kevin and he reminded me what the issue was with running the development version. It basically got out of sync much more often and we were always having to reboot the server. And there was other odd behaviour with mining the same block on both machines etc. Not sure if it was all due to the same issue with dropping peers, but it was basically a lot worse. So on the stable the forgetting peers happens less often and mostly it is OK. On development it was a nightmare, so we really don’t want to run the development version unless it is more stable that last time we tried as we are in the middle of Dapp development and don’t want to have our nodes continually loose sync and go screwy.

Could we up our verbosity to 6 on both nodes and start new log files and next time we go out of sync send you the logs?

Michelle

From: Péter Szilágyi [mailto:notifications@github.com] Sent: 09 March 2016 11:38 To: ethereum/go-ethereum go-ethereum@noreply.github.com Cc: Michelle.Bachler michelle.bachler@open.ac.uk Subject: Re: [go-ethereum] Geth forgetting peers list (#2250)

Hmm, we can't really seem to reproduce the problem. If this happens consistently for you, could you try running develop for a few days and when it appears, enable logging on both servers via debug.verbosity(6) and send us the logs after a few minutes?

— Reply to this email directly or view it on GitHubhttps://github.com/ethereum/go-ethereum/issues/2250#issuecomment-194255868.

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

karalabe commented 8 years ago

Most of the juicy stuff happens at log level 6, so probably that would be needed. The reason I suggested develop was mostly because you can change the log level from within the console, so you could raise it only when the nodes go out of sync instead of having them stay raised all the time (producing an unimaginably huge log file :D).

So all in all running with --verbosity=6 against master is also fine, if it doesn't bother you that much having to deal with the huge logs + the performance hit caused by it.

msbachler commented 8 years ago

OK. I had not considered log size/performance hit. I will talk to Kevin. We could try it for a bit and see how bad it gets re file size and performance hit.

Thanks

Michelle

From: Péter Szilágyi [mailto:notifications@github.com] Sent: 09 March 2016 12:06 To: ethereum/go-ethereum go-ethereum@noreply.github.com Cc: Michelle.Bachler michelle.bachler@open.ac.uk Subject: Re: [go-ethereum] Geth forgetting peers list (#2250)

Most of the juicy stuff happens at log level 6, so probably that would be needed. The reason I suggested develop was mostly because you can change the log level from within the console, so you could raise it only when the nodes go out of sync instead of having them stay raised all the time (producing an unimaginably huge log file :D).

So all in all running with --verbosity=6 against master is also fine, if it doesn't bother you that much having to deal with the huge logs + the performance hit caused by it.

— Reply to this email directly or view it on GitHubhttps://github.com/ethereum/go-ethereum/issues/2250#issuecomment-194266123.

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

msbachler commented 8 years ago

Hi!

I recently had to create a second Ethereum blockchain. I had a lot of trouble keeping the two nodes connected. They kept dropping their peer and getting out of sync. I set the verbosity to 6 to see what I would see. It seems that one end got a ‘useless peer’ , message and dropped the peer. I did some Googling and it seems this message happens if the node thinks the peer is too far out of sink to bother synching with it anymore. Am I understanding that correctly? As this was a new network blocks where being generated really quickly. So I can see why they might quickly get out of sync. I stopped the mining on one of the two nodes and at the console kept running addPeer each time admin.peers came back empty to keep trying to connected them. Sometimes I had to stop both mining before I could get addPeer to work. The block count is now on 3000 and the block generation is slowing down a bit and now with only one of the two nodes mining it seems to be keeping them both in sync. As the block count goes up and the block generation slows down I will experiment with them both mining again to see if they stay syned.

But the underlying issues seems to be that even though this is a two node network with a static peers file and nodiscover set in the geth command, the underlying peer-to-peer protocol does not seem to care that these nodes should NEVER be disconnected and still throws a useless peer and disconnects them from each other. I am now wondering if this is also what is happening on the other, more established network where I reported the disconnection issue, where they lose connection every 2-4 days. I don’t know the peer-to-peer code, but if you are running a fixed network of set nodes, is there some way of telling it never to just decide a node is a ‘useless peer’ and always to keep synching them?

Thanks

Michelle

From: Péter Szilágyi [mailto:notifications@github.com] Sent: 09 March 2016 12:06 To: ethereum/go-ethereum go-ethereum@noreply.github.com Cc: Michelle.Bachler michelle.bachler@open.ac.uk Subject: Re: [go-ethereum] Geth forgetting peers list (#2250)

Most of the juicy stuff happens at log level 6, so probably that would be needed. The reason I suggested develop was mostly because you can change the log level from within the console, so you could raise it only when the nodes go out of sync instead of having them stay raised all the time (producing an unimaginably huge log file :D).

So all in all running with --verbosity=6 against master is also fine, if it doesn't bother you that much having to deal with the huge logs + the performance hit caused by it.

— Reply to this email directly or view it on GitHubhttps://github.com/ethereum/go-ethereum/issues/2250#issuecomment-194266123.

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

Joter271 commented 8 years ago

Is this issue solved? Our geth (1.4.11) node looses all its peers in random intervals from one day to two weeks...

ZhuWeiyang commented 8 years ago

We also have this problem. Our geth (1.4.9) nodes looses some of its peers in random intervals from three days to two weeks.

We set up a three-node private network with nodiscover flag and static-nodes.json configuration. Node 1 is used to synchronize blocks and receive requests from web3 while node 2 and node 3 are used to mine blocks.

Last time when the problem occurs, we tried to run admin.peers on each node. The result was quite confused. On node 1: connect to node 2 and node 3 On node 2: connect to node 3 On node 3: connect to node 2 Node 1 stopped synchronizing blocks at that time. Node 2 and node 3 did not receive any transaction from node 1. As soon as we restarted all three nodes, everything went well.

How can we solve the problem? Is there any suggestions?

fjl commented 8 years ago

We cannot see what the reason for disconnect is without a log file. Using --verbosity=6 will generate too much data. My suggestion is to run geth with --vmodule=eth/*=6,p2p=6 and capture stderr to a file, then attach that file here.

ZhuWeiyang commented 8 years ago

@fjl, following your instructions, we've recorded logs. Here are some logs linked to the problem.

Node1: 172.16.128.82 log_node1.txt

Node2: 172.16.128.56 log_node2.txt

fjl commented 7 years ago

Thanks! In your log Node2 tries to connect to Node1 but Node1 thinks that a connection is already established and denies the attempt.

The interesting part is this (on Node1):

I1101 11:44:01.748290 p2p/peer.go:160] Peer cf175c5f2d0ee9e6 172.16.128.56:59229: write error: write tcp 10.10.3.11:30303->172.16.128.56:59229: write: connection reset by peer
...
I1101 11:44:07.845892 eth/downloader/downloader.go:1559] Peer cf175c5f2d0ee9e6 [hs 3833.04/s, bs 0.00/s, rs 0.00/s, ss 0.00/s, miss    0, rtt 56.638522ms]: body delivery timeout
...
I1101 11:44:13.945954 eth/downloader/downloader.go:1602] Peer cf175c5f2d0ee9e6 [hs 3833.04/s, bs 0.00/s, rs 0.00/s, ss 0.00/s, miss    0, rtt 56.638522ms]: requesting 2 body(s), first at #489424
...

This indicates that the eth protocol handler did not drop the connection after the error. If this happens again, please capture goroutine stacks. You can do this by attaching a JS console (geth attach) and executing debug.stacks().

I will investigate this further.

ZhuWeiyang commented 7 years ago

Thank you for your help! This problem really happens from time to time. Next time we'll bring you the record of goroutine stacks.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

buptxiaofeng commented 6 years ago

any updates?

Ajit666 commented 6 years ago

admin.peers showing empty array.

adamschmideg commented 5 years ago

This issue is old and we didn't have enough information. If you experience it again, please open a new issue.

ethereum / go-ethereum

Geth forgetting peers list #2250