Open cason opened 1 week ago
I've also seen CPU pegged around 100% before a node gets a first peer. I'll collect a CPU profile to see what it's doing next time I reproduce it.
I can confirm the report of delays while attempting to connect to a persistent peer.
Eventually, we just stopped configuring them, and always used seeds only.
There are reports from node operators that while the p2p layer is attempting to reconnect to a persistent peer (typically because it is unavailable, offline, etc.) the overall performance of node degrades substantially. This is specially relevant in networks with short block times, when it is observed an increase in block times and proposers failing to get their blocks committed.
The method responsible for persistently attempt to dial a peer address is
p2p.Switch.reconnectToPeer(*NetAddress)
. There is nothing really special on it in terms of resource consumption. The main calls are for dialing the peer address, which is the same p2p.Switch.DialPeerWithAddress(*NetAddress) used to dial any address, and sleeps.The re-dialing is done using a standard (hard-code) procedure, summarized here. In summary, there are
20
attempts with linear intervals (5s
plus a random jitter up to3s
), then the intervals are exponential, increasing powers of3s
, using the same jitter. At most10
attempts are performed with exponential intervals, so at most30
attempts are performed in total.Turning the parameters used by this procedure configuration parameters has been proposed several times by block operators.
But this issue should focus, in my opinion, on understanding the source of the overhead that has been observed.