followLatest catchup algorithm does not work for small peer sets

urtho commented 2 months ago

Status

When 3.25 follower node runs with small peer set (eg testnet or a colocated relay) the algorithm often aborts sync without immediate resync trigger causing 15 second long pause in tip following for conduit.

Expected

Sync should not run abort when following tip with small peers sets.

Solution

Maybe peers should not be downranked when they return 404 in followLatest mode? 3.21 does not seem to experience this issue but 3.25 does - but that is just an observation without a proper test.

urtho commented 2 months ago

{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635669): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.328379Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.328493Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635668): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.370168Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.370267Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).sync","level":"info","line":732,"msg":"Catchup Service: finished catching up, now at round 41635667 (previously 41635667). Total time catching up 1.936764984s.","name":"","time":"2024-07-06T14:32:16.370325Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":660,"msg":"It's been too long since our ledger advanced; resyncing","name":"","time":"2024-07-06T14:32:33.371646Z"}

example logs from a testnet 3.25 follower

urtho commented 2 months ago

Also - IMHO there should be no scenario that causes the number of peers to go to zero.

gmalouf commented 1 month ago

@urtho before we take any action on this, some of this behavior was introduced in #5836

urtho commented 1 month ago

3.25 follower just hiccups on small peer sets.

I am running smoothly with this dynamic backoff, where pc is peer count across enabled peerSelectors.

    if s.followLatest {
        bo := max(followLatestBackoff, time.Duration(2000/pc)*(time.Millisecond))
        time.Sleep(bo)
    }

algorand / go-algorand