Open urtho opened 2 months ago
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635669): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.328379Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.328493Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":343,"msg":"fetchAndWrite(41635668): Could not fetch: no block available for given round (attempt 11)","name":"","time":"2024-07-06T14:32:16.370168Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).fetchAndWrite","level":"debug","line":309,"msg":"fetchAndWrite: was unable to obtain a peer to retrieve the block from","name":"","time":"2024-07-06T14:32:16.370267Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).sync","level":"info","line":732,"msg":"Catchup Service: finished catching up, now at round 41635667 (previously 41635667). Total time catching up 1.936764984s.","name":"","time":"2024-07-06T14:32:16.370325Z"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":660,"msg":"It's been too long since our ledger advanced; resyncing","name":"","time":"2024-07-06T14:32:33.371646Z"}
example logs from a testnet 3.25 follower
Also - IMHO there should be no scenario that causes the number of peers to go to zero.
@urtho before we take any action on this, some of this behavior was introduced in #5836
3.25 follower just hiccups on small peer sets.
I am running smoothly with this dynamic backoff, where pc
is peer count across enabled peerSelectors.
if s.followLatest {
bo := max(followLatestBackoff, time.Duration(2000/pc)*(time.Millisecond))
time.Sleep(bo)
}
Status
When 3.25 follower node runs with small peer set (eg testnet or a colocated relay) the algorithm often aborts sync without immediate resync trigger causing 15 second long pause in tip following for conduit.
Expected
Sync should not run abort when following tip with small peers sets.
Solution
Maybe peers should not be downranked when they return 404 in followLatest mode? 3.21 does not seem to experience this issue but 3.25 does - but that is just an observation without a proper test.