Sync protocol creates infinite loop if one node loses all it's data.

dan-weaver commented 3 years ago

... If the other node still has what it thinks is valid sync state cached.

I created a test case here in https://github.com/automerge/automerge/pull/363

Maybe it's not a bug but I'm looking for guidance on how best to handle this case if it's something that should be handled outside of automerge.

ept commented 3 years ago

Hi @dan-weaver, thanks for this. The intended use of the sync API is that whenever you disconnect from a peer you use Automerge.Backend.encodeSyncState(syncState) to serialise the sync state to a byte array that can be written to disk, and when you reconnect, you use Automerge.Backend.decodeSyncState() to restore it again. You should do this saving and reloading on every disconnect/reconnect, even if the app continues running throughout this process. (I know this isn't well documented at the moment!)

In the test in #363, if you add s1 = decodeSyncState(encodeSyncState(s1)) at the point where the data loss occurs, the infinite loop is fixed and the nodes resynchronise correctly (using the protocol's existing logic to detect and recover from data loss). In a real system, this means we are assuming that data loss on a node can only happen while it is disconnected (i.e. we assume we cannot have data loss without a crash that would be noticed as a disconnect). I am now wondering if that is really a safe assumption to make. I guess it depends on the network transport layer in use, and what its concept of a "connection" looks like.

I'm thinking maybe we should still include your patch in order to handle the edge case of a node losing data without disconnecting. I will note that your patch only detects complete data loss, and it will not handle partial data loss (where some changes are lost but some heads survive), but it's still an improvement to the robustness of the protocol. What do you think?

dan-weaver commented 3 years ago

I think that seems to make sense to me. We do attempt to follow the instructions of saving and encoding / decoding that sync state.

I have a number of theories for why we're running into this edge case but haven't really pinned it down yet. For whatever reason I guess clients are failing to receive the disconnect notification either because it's getting lost or possibly we have a buggy client somewhere.

I was also hoping in some instances that we could only temporarily cache the syncStates for example if we haven't seen that client in a long time it might be reasonable to evict those states but still expect the protocol to work. Likewise I'm wondering if evicting entire documents from storage would ever be a valid thing to do if one client decides for whatever reason they just won't be interested in that data, only to decide later that they are. Currently that would require us to notify other participants that we've done so, however those clients could be offline at that time.

I'm not that familiar with what the greater implications of my changes here might be so I'll trust you on that I just tried things until it passed :). We'd certainly gain value from the patch I think though.

automerge / automerge-classic

Sync protocol creates infinite loop if one node loses all it's data. #364