p2p: investigate why re-dialing persistent peers consumes so many resources

cason commented 1 week ago

There are reports from node operators that while the p2p layer is attempting to reconnect to a persistent peer (typically because it is unavailable, offline, etc.) the overall performance of node degrades substantially. This is specially relevant in networks with short block times, when it is observed an increase in block times and proposers failing to get their blocks committed.

The method responsible for persistently attempt to dial a peer address is p2p.Switch.reconnectToPeer(*NetAddress). There is nothing really special on it in terms of resource consumption. The main calls are for dialing the peer address, which is the same p2p.Switch.DialPeerWithAddress(*NetAddress) used to dial any address, and sleeps.

The re-dialing is done using a standard (hard-code) procedure, summarized here. In summary, there are 20 attempts with linear intervals (5s plus a random jitter up to 3s), then the intervals are exponential, increasing powers of 3s, using the same jitter. At most 10 attempts are performed with exponential intervals, so at most 30 attempts are performed in total.

Turning the parameters used by this procedure configuration parameters has been proposed several times by block operators.

But this issue should focus, in my opinion, on understanding the source of the overhead that has been observed.

jchappelow commented 1 week ago

I've also seen CPU pegged around 100% before a node gets a first peer. I'll collect a CPU profile to see what it's doing next time I reproduce it.

faddat commented 3 days ago

I can confirm the report of delays while attempting to connect to a persistent peer.

Eventually, we just stopped configuring them, and always used seeds only.

cometbft / cometbft

p2p: investigate why re-dialing persistent peers consumes so many resources #3267