fix: adding retry mechanism to fetch global snapshots from another peer if it fails during rollback

IPadawans commented 2 months ago

Changes

Adding a retry mechanism of 10 times when not find the snapshot in a peer, during rollback.
This error was finishing the metagraph restart when we fetch from a peer that might not contain the global snapshot per hash
I've also added more logs to follow during the traverse.

Tests

It's working as expected

marcinwadon commented 2 months ago

As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?

AlexBrandes commented 2 months ago

As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?

In the cases that Marcus and I looked at together the nodes were in Ready state but stuck at an old ordinal, likely out of consensus but not removed from the network. The requests for snapshots were 404ing. Requests for /global-snapshots/latest (as a debug step) were succeeding but ordinal was not increasing.

I would defer to your judgement overall here but I would suggest we still include fallback behavior like this even if it should be impossible for a node to be in Ready state and not respond correctly. A retry with the issue logged would let us recover gracefully in the case of some future unknown bug even if we fix the root cause of this one.

marcinwadon commented 2 months ago

I have opened #925 that uses cats-retry, so I'm closing this PR.

Constellation-Labs / tessellation

fix: adding retry mechanism to fetch global snapshots from another peer if it fails during rollback #924

Changes

Tests