Closed IPadawans closed 2 months ago
As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?
As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?
In the cases that Marcus and I looked at together the nodes were in Ready state but stuck at an old ordinal, likely out of consensus but not removed from the network. The requests for snapshots were 404ing. Requests for /global-snapshots/latest (as a debug step) were succeeding but ordinal was not increasing.
I would defer to your judgement overall here but I would suggest we still include fallback behavior like this even if it should be impossible for a node to be in Ready state and not respond correctly. A retry with the issue logged would let us recover gracefully in the case of some future unknown bug even if we fix the root cause of this one.
I have opened #925 that uses cats-retry, so I'm closing this PR.
Changes
Tests
It's working as expected