Open aweneagle opened 1 year ago
The discovery protocol maintains a liveness counter for each node in the local table. For every liveness check (PING) that is successful (received PONG response), the counter is increased by one. When the node fails a check, the counter is halved. This effectively implements your suggestion: if the node has been live for a longer time, it is OK to fail a liveness check occasionally.
When you reboot your node, are you ensuring the node key stays the same? P2P nodes are identified not just by their IP/port, but also by their node ID (which is derived from the node key). So when you transfer your node to a different pod, you need to ensure the node key is also carried over.
System information
Geth version:
latest
CL client & version:latest
OS & Version: LinuxExpected behaviour
Expected: whenever restarting p2p node, it can rejoin the boot node easily.
Actual behaviour
Actual: When it comes to NLB, It's really very difficult for p2p to rejoin in boot node.
Steps to reproduce the behaviour
I built a small private network of eth for testing. One node as boot node hidden behind a network load balance service (NLB), Two as p2p nodes , also behind a NLB. The network looked like this:
When I finished the network, it works fine. But When I restart the p2p nodes, they were removed by boot node and never came back!
Backtrace
Here is the debug logs from p2p node when restarting: "started discovery service"
Here is that from boot node when removing the p2p node: "dead node"
And here are the configurations of p2p node and boot node, p2p node:
boot node:
We can see that the p2p node was restarted success at "09:18:01" and soon removed after 3 seconds at "09:18:04". The log "Removed dead node" means that two things had been done successfully within 3 seconds:
Finally i find the root cause: the first PING packet sent by boot node was transfer to the previous pod ip of p2p nodes, in the NLB of the p2p node side.Illustrated below:
Since it need sometime to refresh the router info for NLB(usually more than 3 seconds), the P2P node was very difficult to received the first PING from boot node. I am wondering that: Shall we PING more times rather than only one time before we confirm that the peer is not alive? Like After 6 times PING failed then we say it's dead ?
I can understand that a high quality of network condition can be ensure by "PING only one time and require PONG within 700 millseconds", but it's really a little tricky that they can't be be placed behind NLB....