Open sroze opened 10 months ago
Good questions.
NetworkAdapter
s should emit "peer-disconnected"
when a connection is lost, which will notify the Repo
to stop sending messages on that connection. Obviousy disconnection often requires waiting for a timeout in the underlying transport. To the second point, I have played with various implementations of "flush before ack" type things, but I'm still uncertain as to whether it's worth the performance hit and implementation complexity. One of the great things about local-first architectures is that you can be less dependent on the reliability of a single machine. Say the receiving peer is a sync server, if that peer falls over before saving to storage then as long as some peer has the data then they will resend it to the sync server when it comes back up. This is not to say that we shouldn't implement some kind of enhanced durability mode, but I would like to have a motivating usecase for it because so far it's not been needed so I don't know what the API should look like. In that vein would you be able to describe your architecture and what kind of data loss you are concerned about?
Very clear, thank you @alexjg. Just so I understand your point though, what would happen in the following scenario:
To prevent data loss, during 3, changes from 2 & 3 needs to be persisted. How would the client know that during "3", the "client peer" needs to sends both changes made in 2 & 3? If it doesn't, how are we preventing data loss? Sorry if they seem basic questions, I'm not fully grasping the exchanges between peers.
I'm not quite sure your scenario is correct. If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it. The error may still be asynchronous though because we throttle flushes, that means the scenario might be more like this:
This will work even if saves end up failing out of order, e.g.
Now we're in a situation where the server has change 1 and change 3 on disk.
If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it
Okay, so this means that we rely on NodeJS' unhandled exceptions to crash the server, which is not NodeJS' default behaviour. I think it's very risky for the library to expect this to happen. In any reasonably high throughput scenarios, when running NodeJS APIs as Docker containers, I wouldn't expect this to be the case either: a server crash would be causing a large number of requests to fail, I would expect errors to be handled on a per-request basis.
Interesting, my understanding was that the default behavior for uncaught exceptions in Node is to crash the process, which is implied by this documentation, have I misunderstood or maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?
Regardless, I agree, crashing the whole server is undesirable if you expect transient storage failures. Loading the document into memory can be a CPU bound task which would mean that starting the server up again with all the documents that clients were connected to might take a bit of time (we are working hard on making loading much much faster and further down the line on modifying the sync and storage protocols to allow sync servers to not have the document in memory at all).
I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document. Fortunately, we do have provisions in the sync protocol for restarting sync on an existing connection by resetting the sync state on the restarting peer, so I think what we would need to do is this:
This doesn't help if the storage failure is not transient though. In that case we will just stop processing and then sit there waiting for the documents to load whilst storage continues to fail. In this case it seems to me that it would be preferable to crash the server so that clients and orchestration tools can at least know that something is wrong. This makes me think that if you expect transient storage errors a better place to handle that might be in the storage adapter where you can do some kind of backoff on failures?
Uncaught exceptions crash the process, here's a little example:
setInterval(() => {
if (Math.random() > 0.5) {
throw new Error('Random error')
}
console.log('Hello')
}, 1000)
maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?
In production apps, you'll often find an something like:
process.on('uncaughtException', () => {
// log error
// pipe error to error tracking, i.e. Sentry
})
I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document.
@alexjg could synchronization of an update to other peers begin only after that update is preserved on disk? DocHandle could accept some storage callback which it should wait to finish before emitting events that trigger network sync.
In such a scenario, we could not emit sync events on storage failure. Thus each server connection would not know about non-saved data, and we could gracefully recover from storage failure
Looking at the code, I'm trying to figure out how much I can trust automerge-repo from a durability question and I'd love to have a few pointers, for pieces I don't understand.
Repo
's use of the storage subsystem is out-of-band (in other to debounce).. As such, does it mean that once the message has been sent and ack'ed by the Repo, there is a risk of data loss, because it might not have been persisted by the server?Thank you so much 🙏