automerge / automerge-repo

MIT License
468 stars 52 forks source link

Data durability question #264

Open sroze opened 10 months ago

sroze commented 10 months ago

Looking at the code, I'm trying to figure out how much I can trust automerge-repo from a durability question and I'd love to have a few pointers, for pieces I don't understand.

  1. When peers send messages to each other, there doesn't seem to be any sort of acknowledgment. As such, is it fair to say that if a server fails to process a given message (i.e. storage fails or power outage at this moment for instance), other peers won't know about it? If so, wouldn't they continue to send further messages and fail to do so given the de-sync of heads?
  2. Repo's use of the storage subsystem is out-of-band (in other to debounce).. As such, does it mean that once the message has been sent and ack'ed by the Repo, there is a risk of data loss, because it might not have been persisted by the server?

Thank you so much 🙏

alexjg commented 9 months ago

Good questions.

  1. There isn't an acknowledgement as such, although there is logic to make sure we don't send messages unless either we have new changes or the other end has responded to the last message we sent, so we don't send messages indefinitely. More importantly, the transport is assumed to be reliable and NetworkAdapters should emit "peer-disconnected" when a connection is lost, which will notify the Repo to stop sending messages on that connection. Obviousy disconnection often requires waiting for a timeout in the underlying transport.
  2. True, currently it is possible for a peer to acknowledge changes but then to fall over before the change makes it to storage. In this case when the peer comes back up again then the sender would reconnect and resend their changes (this is handled by the sync protocol).

To the second point, I have played with various implementations of "flush before ack" type things, but I'm still uncertain as to whether it's worth the performance hit and implementation complexity. One of the great things about local-first architectures is that you can be less dependent on the reliability of a single machine. Say the receiving peer is a sync server, if that peer falls over before saving to storage then as long as some peer has the data then they will resend it to the sync server when it comes back up. This is not to say that we shouldn't implement some kind of enhanced durability mode, but I would like to have a motivating usecase for it because so far it's not been needed so I don't know what the API should look like. In that vein would you be able to describe your architecture and what kind of data loss you are concerned about?

sroze commented 9 months ago

Very clear, thank you @alexjg. Just so I understand your point though, what would happen in the following scenario:

  1. "client peer" sends an update, "server peer" receives it and stores it (successfully) in its storage.
  2. "client peer" sends another update based on 1, "server peer" receives it but the out of band storage fails (currently, the error is simply ignored).
  3. "client peer" sends another update based on 2, "server peer" receives it and stores it.

To prevent data loss, during 3, changes from 2 & 3 needs to be persisted. How would the client know that during "3", the "client peer" needs to sends both changes made in 2 & 3? If it doesn't, how are we preventing data loss? Sorry if they seem basic questions, I'm not fully grasping the exchanges between peers.

alexjg commented 9 months ago

I'm not quite sure your scenario is correct. If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it. The error may still be asynchronous though because we throttle flushes, that means the scenario might be more like this:

  1. client sends update, server receives and stores it
  2. client sends another update, server receives it
  3. sometime later server flushes, crashes, and restarts
  4. the client loses the original connection due to the restart and reconnects, which resets the sync state on the client
  5. the sync protocol negotiates what information each side has, causing the client to retransmit the second change, which was never saved
alexjg commented 9 months ago

This will work even if saves end up failing out of order, e.g.

  1. client sends change 1, server receives and stores
  2. client sends change 2
  3. server begins flush of change 2
  4. client sends change 3
  5. server begins flush of change 3
  6. change 3 flush succeeds
  7. flush of change 2 fails

Now we're in a situation where the server has change 1 and change 3 on disk.

  1. server restarts
  2. client restarts sync negotiation, determines that the server is missing change 3 and retransmits it
sroze commented 9 months ago

If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it

Okay, so this means that we rely on NodeJS' unhandled exceptions to crash the server, which is not NodeJS' default behaviour. I think it's very risky for the library to expect this to happen. In any reasonably high throughput scenarios, when running NodeJS APIs as Docker containers, I wouldn't expect this to be the case either: a server crash would be causing a large number of requests to fail, I would expect errors to be handled on a per-request basis.

alexjg commented 9 months ago

Interesting, my understanding was that the default behavior for uncaught exceptions in Node is to crash the process, which is implied by this documentation, have I misunderstood or maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?

Regardless, I agree, crashing the whole server is undesirable if you expect transient storage failures. Loading the document into memory can be a CPU bound task which would mean that starting the server up again with all the documents that clients were connected to might take a bit of time (we are working hard on making loading much much faster and further down the line on modifying the sync and storage protocols to allow sync servers to not have the document in memory at all).

I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document. Fortunately, we do have provisions in the sync protocol for restarting sync on an existing connection by resetting the sync state on the restarting peer, so I think what we would need to do is this:

  1. Catch the storage exception and identify the documents which have been affected
  2. Stop processing incoming messages for the documents in question (actually, discard them)
  3. Reload the documents from storage
  4. reset (encode and decode) the sync states for every peer connected to each document
  5. Generate new outgoing messages for every (peer, document) pare and resume processing sync messages for the documents

This doesn't help if the storage failure is not transient though. In that case we will just stop processing and then sit there waiting for the documents to load whilst storage continues to fail. In this case it seems to me that it would be preferable to crash the server so that clients and orchestration tools can at least know that something is wrong. This makes me think that if you expect transient storage errors a better place to handle that might be in the storage adapter where you can do some kind of backoff on failures?

kid-icarus commented 9 months ago

Uncaught exceptions crash the process, here's a little example:

setInterval(() => {
  if (Math.random() > 0.5) {
    throw new Error('Random error')
  }
  console.log('Hello')
}, 1000)

maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?

In production apps, you'll often find an something like:

process.on('uncaughtException', () => {
  // log error
  // pipe error to error tracking, i.e. Sentry
})
mykola-vrmchk commented 7 months ago

I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document.

@alexjg could synchronization of an update to other peers begin only after that update is preserved on disk? DocHandle could accept some storage callback which it should wait to finish before emitting events that trigger network sync.

In such a scenario, we could not emit sync events on storage failure. Thus each server connection would not know about non-saved data, and we could gracefully recover from storage failure