ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.62k stars 20.16k forks source link

p2p discovery: public P2P nodes removed from boot nodes after being restarted #28528

Open aweneagle opened 1 year ago

aweneagle commented 1 year ago

System information

Geth version: latest CL client & version: latest OS & Version: Linux

Expected behaviour

Expected: whenever restarting p2p node, it can rejoin the boot node easily.

Actual behaviour

Actual: When it comes to NLB, It's really very difficult for p2p to rejoin in boot node.

Steps to reproduce the behaviour

I built a small private network of eth for testing. One node as boot node hidden behind a network load balance service (NLB), Two as p2p nodes , also behind a NLB. The network looked like this: opbnb P2P 网络不通问题 (1)

When I finished the network, it works fine. But When I restart the p2p nodes, they were removed by boot node and never came back!

Backtrace

Here is the debug logs from p2p node when restarting: "started discovery service"

截屏2023-11-15 12 08 44

Here is that from boot node when removing the p2p node: "dead node"

截屏2023-11-15 12 07 55

And here are the configurations of p2p node and boot node, p2p node:

     --p2p.sync.req-resp 
     --p2p.listen.ip=0.0.0.0
     --p2p.listen.tcp=9003
     --p2p.listen.udp=9003
     --p2p.priv.raw={priv key}
     --p2p.advertise.ip={public ip}

boot node:

    --p2p.listen.ip=0.0.0.0
    --p2p.listen.tcp=9003
    --p2p.listen.udp=9003
    --p2p.priv.raw={priv key}
    --p2p.advertise.ip={public ip}

We can see that the p2p node was restarted success at "09:18:01" and soon removed after 3 seconds at "09:18:04". The log "Removed dead node" means that two things had been done successfully within 3 seconds:

  1. boot node handshake with p2p node.
  2. boot node add p2p node into its table.
  3. boot node send a PING to p2p node.

Finally i find the root cause: the first PING packet sent by boot node was transfer to the previous pod ip of p2p nodes, in the NLB of the p2p node side.Illustrated below:

截屏2023-11-15 14 35 43

Since it need sometime to refresh the router info for NLB(usually more than 3 seconds), the P2P node was very difficult to received the first PING from boot node. I am wondering that: Shall we PING more times rather than only one time before we confirm that the peer is not alive? Like After 6 times PING failed then we say it's dead ?

I can understand that a high quality of network condition can be ensure by "PING only one time and require PONG within 700 millseconds", but it's really a little tricky that they can't be be placed behind NLB....

fjl commented 11 months ago

The discovery protocol maintains a liveness counter for each node in the local table. For every liveness check (PING) that is successful (received PONG response), the counter is increased by one. When the node fails a check, the counter is halved. This effectively implements your suggestion: if the node has been live for a longer time, it is OK to fail a liveness check occasionally.

When you reboot your node, are you ensuring the node key stays the same? P2P nodes are identified not just by their IP/port, but also by their node ID (which is derived from the node key). So when you transfer your node to a different pod, you need to ensure the node key is also carried over.