Constellation-Labs / tessellation

Monadic execution contexts for topology organization
Apache License 2.0
49 stars 28 forks source link

fix: adding retry mechanism to fetch global snapshots from another peer if it fails during rollback #924

Closed IPadawans closed 2 months ago

IPadawans commented 2 months ago

Changes

Tests

It's working as expected

Captura de Tela 2024-08-10 às 09 25 57 Captura de Tela 2024-08-10 às 09 39 29
marcinwadon commented 2 months ago

As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?

AlexBrandes commented 2 months ago

As I think about it - maybe the issue is not with the fetch snapshot process and your fix is just a shortcut, and the issue may be with the node status itself. There should be no possibility that node is Ready and does not answer correctly with snapshots. Or maybe we fetch from non-ready or unresponsive peers? Have you checked that first?

In the cases that Marcus and I looked at together the nodes were in Ready state but stuck at an old ordinal, likely out of consensus but not removed from the network. The requests for snapshots were 404ing. Requests for /global-snapshots/latest (as a debug step) were succeeding but ordinal was not increasing.

I would defer to your judgement overall here but I would suggest we still include fallback behavior like this even if it should be impossible for a node to be in Ready state and not respond correctly. A retry with the issue logged would let us recover gracefully in the case of some future unknown bug even if we fix the root cause of this one.

marcinwadon commented 2 months ago

I have opened #925 that uses cats-retry, so I'm closing this PR.