IntersectMBO / ouroboros-network

Specifications of network protocols and implementations of components running these protocols which support a family of Ouroboros Consesus protocols; the diffusion layer of the Cardano Node.
https://ouroboros-network.cardano.intersectmbo.org
Apache License 2.0
276 stars 86 forks source link

Selection preassure for peers with long chainsync timeout. #4244

Open karknu opened 1 year ago

karknu commented 1 year ago

When a connection is promoted to hot a timeout will be randomly picked from the array [90, 135, 180, 224, 269] to be used by the chainsync protocol. When there is a gap in block production the timeout will trigger and the peer will be demoted to cold. The idea is that during a gap in block production only a subset of peers will be replaced. This scheme works fine with the static peers when running in non-p2p mode.

In the p2p case there is a tendency for the set of hot peers to contain more and more peers with long chainsync timeout. Example: A node starts with 20 hot peers with the following timeouts [4 x 90, 4 x 135, 4 x 180, 4 x 224, 4 x 269]. There is a 91s long gap in block production. This means that the four peers with 90s timeout are replaced with four new peers with random timeouts. This happens for all p2p nodes, timed out peers are replaced with peers with new random timeouts.

This means that peers with large timeout accumulates in the set of hot peers in all nodes. When a 224s gap finally happens it isn't 20% of all peers being replaced but it could be 30% or 40%.

Instead of using a constant timeout for the lifetime of the connections it would be better if a timeout could be randomly picked by the chainsync protocol as it prepares to wait for the peer to present it with a new tip.

karknu commented 1 year ago

For the non-p2p case, a connection with an expired short timeout is replaced with a new connection to the same peer with new random timeout. This means that the drive to end up with connections with long chainsync timeout is present in the non-p2p case too.

coot commented 1 year ago

Nice discovery! Yes we should indeed draw the timeout in StMustReply state of chain-sync mini-protocol.