Automerge 1.0 sync protocol

ept commented 4 years ago

Hi everyone, I would like to gauge opinions on a design decision that needs to be made for the performance branch and Automerge 1.0. It is particularly relevant for people working on networking layers for Automerge, such as Hypermerge (@pvh) and Cevitxe (@HerbCaudill).

We have long talked about using Git-style hash chaining to capture dependencies between changes (#27, #200), instead of the current system of referencing dependencies by their actorId and sequence number. This avoids bugs due to applications inadvertently reusing actorIds or sequence numbers (e.g. #234). Hash chaining is now implemented as part of the compressed binary format (#253), and it seems to be working well in terms of document storage.

However, we now need to figure out the impact on Automerge networking layers. When two peers connect to each other, or when a client connects to a server, they need to figure out which changes they need to send each other in order to get in sync. At the moment, this is typically done using vector clocks: that is, a mapping from actorIds to sequence numbers, indicating the highest sequence number we've seen from a given actorId. If you receive a vector clock and you have some changes with higher sequence numbers for a given actorId, you send those changes to the other node. This makes for nice and simple protocols. It also generalises to variants: for example, Hypermerge uses a Hypercore (an append-only log) per actorId.

I am keen to keep the core Automerge library network-protocol-agnostic, so that it can be paired up with different networking layers depending on the needs of particular applications. But we do need to figure out the core APIs that Automerge needs to provide in order to allow the networking layers to do their work. For example, #284 adds an explicit vector clock API (previously you had to mess around with the internal data structures to get the vector clock). What do we want this API to look like for Automerge 1.0?

One option is to keep vector clocks mostly as they are. This is possible because the binary format still has actorIds and sequence numbers — they are just no longer being used to track dependencies between changes.
Alternatively, we can provide a different API that networking layers can use to figure out what changes to send, which is based solely on hashes. This can be done: @heidihoward and I just published a paper containing algorithms to do just that (paper, blog post).

Advantages of vector clocks:

Existing networking layers for Automerge don't need to change. I know @HerbCaudill has advocated keeping vector clocks for this reason.
Simple algorithm, easy to understand and implement. For example, you can implement a WebSocket server in any language, without having to import the whole of Automerge, and that server can sync changes between Automerge clients (e.g. #285).

Disadvantages of vector clocks:

Vector clocks grow in size to include every actorId that has ever edited a document. If the application has many short-lived editing sessions, each of which generates a new actorId, this could become a noticeable cost. Even more so if we create new actorIds for individual fine-grained change suggestions, as @pvh has proposed. (It's possible to compress vector clocks by leaving out transitive dependencies, but then the synchronisation algorithm is no longer so simple, because it has to traverse the dependency graph.)
With a vector-clock-based sync algorithm, we have to deal with the possibility that different nodes have different changes for the same actorId and sequence number, because actorIds or sequence numbers were reused for whatever reason. For example, two users may have the same vector clock, but may not actually have the same changes. Hashes allow us to detect that this has happened, but what do we do then? Just give up and throw an exception, or move ahead and try to sync and merge them anyway?

Sequence number reuse could happen for various reasons:

A bug in the application causes it to accidentally use an old version of an Automerge object.
In a local-first application, a user restores a document file from backup, causing it to revert to an earlier state in which some changes haven't happened yet. When the user makes changes to the restored document, they end up using sequence numbers that had been used before. (This assumes the actorId is persistent across editing sessions.)
Someone deliberately creates changes with conflicting sequence numbers to mess with other users. In an open system such as a wiki, this sort of vandalism is not inconceivable.

If we are going to keep vector clocks as the basis of the sync algorithm, it might make sense at least to augment them with hashes, so that nodes can detect inconsistencies. I also think including some defenses against malicious behaviour might be wise. For example, if malicious user Mallory can sync one change with Alice, and a different change with the same actorId/seq with Bob, it would be annoying for Alice and Bob if they can now no longer sync with each other. For this reason, I think we should not just throw an exception if an inconsistency is detected, but actually try to get the nodes to sync nevertheless.

If we need to recover from such inconsistencies, the simple vector clock protocol is not sufficient. Once we have figured out that two nodes have the same vector clock, but different hashes, we have to search backwards in the hash graph until we find the place where the nodes have diverged. Doing this naively can result in waiting for a lot of round-trips.

The protocol that Heidi and I have designed addresses this problem: it allows two nodes with arbitrary hash DAGs to get in sync efficiently. In most cases, it requires only one round trip, regardless of the lengths of the hash chains, and it has minimal overheads. It has a small probability of requiring more than one round trip (in our experiments, that probability is about 3%; it can be adjusted by tuning parameters of the algorithm). If we do want to be able to detect inconsistencies and recover from them, I think this protocol is the best approach.

The question then is what the API between Automerge and the networking layers should look like. In #284 we have getClock() to get a vector clock that can be sent over the network, and getMissingChanges() where we pass in a vector clock and get back the changes that are newer than that clock. We can keep the same pattern where there is one function to get a message that you can send, and another function where you pass in that message to get a list of newer changes. The hash graph traversal parts of the protocol can then be implemented in Automerge core, whereas all the I/O is in a separate layer, as it is now.

This matter is also relevant for storage layers. For example, if you are storing changes in a database, and a client sends you their vector clock, you can do a database query on actorIds and sequence numbers to find the changes that are newer than that vector clock. If we move to a hash-chain-based protocol, these database queries become more complicated. It would still be possible for a storage server to parse the changes to determine the dependency graph (without having to necessarily run a full instance of Automerge), and to find the required changes in response to a query, but it would be a bit more complicated. We could provide a separate, smaller library that performs just this logic.

To summarise, I think the viable options are:

Continue using vector clocks as they are now, with no hash-based detection of inconsistencies, and just cross our fingers and hope that no inconsistencies are introduced, as we have been doing until now.
Use Heidi and my protocol to sync up nodes based on their hash graphs, which is resilient to arbitrarily misbehaving nodes.
Support both options 1 and 2. Not my preferred option, since more choices make things harder for users, but possible if we can't find agreement.

I think other alternatives, such as augmenting vector clocks with hashes and throwing an exception if the hashes don't match, are less desirable, because they can leave users in a state where their app stops working and they can't easily do anything about it.

If there is interest in option 2, I can write up a more detailed API spec to define the interface between core Automerge and the networking layer. Please let me know your opinions!

HerbCaudill commented 4 years ago

The only obstacles to getting rid of vector clocks seem to amount to (a) my original whinge about having to rewrite some code 😄, and (b) the slight uptick in complexity when dealing with Automerge outputs in contexts where we don't have Automerge around.

As far as (a) goes, I formally retract my objection - it's not that big a lift to rework that code, and I certainly wouldn't want this project to be locked into decisions made before version 1, just because I experimented with prerelease code.

That leaves (b) which doesn't seem like a dealbreaker either, whether at the networking level or at the storage level. In fact it seems like something that you'd really want to discourage. I really shouldn't be using my knowledge of Automerge's internals to muck about with its outputs on my own -- that's what got us in this situation in the first place. It seems much cleaner to keep that logic internal to the library.

HerbCaudill commented 4 years ago

For example, you can implement a WebSocket server in any language, without having to import the whole of Automerge, and that server can sync changes between Automerge clients (e.g. #285).

FWIW I don't know of any servers intended for use with Automerge that actually get involved in the synchronization process - Cevitxe doesn't work that way, and last time I looked Hypermerge didn't. Cevitxe's "signal server" only cares about topic IDs, which are external to Automerge; all synchronization logic resides in the clients.

pvh commented 4 years ago

How much have you considered privacy here, Martin? I'm considering whether the Bloom filters might leak data a user may not want to share, say in the case of partial synchronization.

ept commented 4 years ago

Thanks for your comments so far!

How much have you considered privacy here, Martin?

I think from a privacy point of view, there is not a big difference: both techniques expose information about who has made how many changes (and how big those changes were) over what timeframe. I am keen to improve privacy (in particular, we have a masters student who over the coming months will be exploring ways of avoiding exposing users' IP addresses during peer discovery) but I think those questions are orthogonal to this issue. Also, networking layers can add end-to-end encryption without significantly affecting the sync protocol.

pvh commented 4 years ago

I'm concerned about the inability to guarantee a single round-trip transfer of the needed data. It's hard to see how you can have efficient asynchronous (via usb-stick, or email, or whatever) data exchange.

I also think it's important to realize that no real-world automerge systems operate on synchronizing single documents. A synchronization protocol really needs to account for the fact that users are likely going to have many automerge documents that need synchronizing, particularly when you consider that hypermerge and cevitxe both encourage users to use many small documents to model state. In hypermerge we wound up exchanging lists of all the documents each peer had (using a variant of the socialist millionaire protocol) and then comparing vector clocks for shared documents.

With your new approach, I can imagine a larger Bloom filter that would allow comparing whole repositories at once, but this would elevate these side-channel concerns.

ept commented 4 years ago

I'm concerned about the inability to guarantee a single round-trip transfer of the needed data. It's hard to see how you can have efficient asynchronous (via usb-stick, or email, or whatever) data exchange.

This use case can be satisfied by copying the whole compressed Automerge document into the usb stick/email, and we can have a merge function that takes two such documents and combines their changes into one. Thus, no interactive communication is needed.

A similar approach could be taken by networking layers as well: rather than figuring out which changes to send each other, two peers could just send each other the latest state of their entire document. This is not actually as bad as it sounds, because the binary format is so compact. But repeatedly sending an entire document would not be great for real-time collaboration, so we still need a mode that sends only incremental changes.

This does suggest a new option that I didn't consider previously: we could stick with vector clocks, and augment them with hashes that allow two peers to check whether they really are in the same state. If an inconsistency is detected, the peers send each other their entire compressed documents, and merge those, thereby resolving the inconsistency. This approach still has the downside that vector clocks can get big, but it would guarantee that the protocol completes in one round trip.

I also think it's important to realize that no real-world automerge systems operate on synchronizing single documents. A synchronization protocol really needs to account for the fact that users are likely going to have many automerge documents that need synchronizing

Good point. However, do you also want to choose what to share with other users on a document-by-document basis? If so, we have to treat each document's hash graph as independent from that of any other document (we can't combine changes from multiple documents into a single graph). A networking layer can and should sync many documents over a single connection, but I'm not sure what Automerge core can do to facilitate that. Do you have any suggestions on how multi-document apps can be better supported?

pvh commented 4 years ago

This does suggest a new option that I didn't consider previously: we could stick with vector clocks, and augment them with hashes that allow two peers to check whether they really are in the same state. If an inconsistency is detected, the peers send each other their entire compressed documents, and merge those, thereby resolving the inconsistency. This approach still has the downside that vector clocks can get big, but it would guarantee that the protocol completes in one round trip.

It's not clear to me yet which approach would be better. I think a lot depends on the specifics of the environment for the application. Is this a decision with long-term consequences or one that is hard to amend?

I also think it's important to realize that no real-world automerge systems operate on synchronizing single documents. A synchronization protocol really needs to account for the fact that users are likely going to have many automerge documents that need synchronizing

Good point. However, do you also want to choose what to share with other users on a document-by-document basis? If so, we have to treat each document's hash graph as independent from that of any other document (we can't combine changes from multiple documents into a single graph). A networking layer can and should sync many documents over a single connection, but I'm not sure what Automerge core can do to facilitate that. Do you have any suggestions on how multi-document apps can be better supported?

Thus far, in Hypermerge, our approach has been to identify all "documents" with an ID, and assume that a document is a sharing scope. It would be possible for the actors in a document to be reused in other documents, but although we experimented with this in past projects the current hypermerge implementation does not really support this.

In the current hypermerge design, we assume a preference for aggressive synchronization and attempt to share all missing changes in mutually recognized documents between peers. I imagine a single bloom filter could be used for this.

ept commented 4 years ago

Is this a decision with long-term consequences or one that is hard to amend?

It has a bunch of implications for how a document is laid out internally, and if we want to change the sync protocol it will create a compatibility issue for any nodes that want to communicate while running different versions. It's probably not super hard to change in the future, but I would prefer if we could agree at least a tentative plan to factor into the roadmap for getting us to 1.0. And I see this as an opportunity to make networking layers better too!

In the current hypermerge design, we assume a preference for aggressive synchronization and attempt to share all missing changes in mutually recognized documents between peers. I imagine a single bloom filter could be used for this.

Using a single Bloom filter for multiple documents wouldn't really make any performance difference, because the Bloom filter is already sized dynamically based on the number of changes it contains. A document that has not been updated since the last sync needs 0 bits of Bloom filter space.

To sync a repository containing many documents, the simplest approach would be to send the head hashes and Bloom filter for each document ID every time a connection is established, and thereafter send any incremental changes as they appear. Since the head hashes are constant-size, and the Bloom filters take no space for a document that hasn't changed, this initial sync would take <100 bytes of network bandwidth per unchanged document. This approach would scale to thousands of documents. The protocol could further optimise this by aggregating the head hashes from multiple documents into a single hash (e.g. building a Merkle tree).

josharian commented 4 years ago

A few thoughts, with the caveats that I am new to Automerge, and currently only using it for a side project. (However, my day job is at Tailscale, a mesh VPN, which helps with some local-first software networking issues, like connectivity, NAT traversal, point-to-point security, and identity guarantees.)

Security and resilience to bugs are very important qualities. To my mind they are a compelling reason to choose option (2).

One important related question is compatibility guarantees. Will the network layer be stable after 1.0? And will it be recorded independently as a spec, as opposed to just "whatever the implementation does"?

Pros to compatibility guarantees:

No need to work around version skew. For peer to peer communications, this is a significant factor, both because of the cost and testing cost, and because you can end up burning 1 RTT doing version negotiation. With a server in the middle, it can do on-the-fly translation, which is still unfortunate, but a bit less so.
People can confidently write new server implementations. (I'd be reasonably likely to write a Go implementation, assuming moderate spec complexity.)

Cons:

Loss of flexibility.
Extra short-term work writing a spec and (ideally) language-agnostic test fixtures.
The (clever) Bloom filter implementation doesn't have many miles on it yet, which makes it riskier to bake in for the long term.

One last desideratum. From my perspective, it would be nice (but not required) for a networking layer to support a store-and-forward/mailbox style server implementation to enable collaboration among peers who are not necessarily online at the same time, but without storing the entire document long term on the server. (For optimal privacy, messages could even be encrypted.) Peers could optimistically post new changes to the server, to be broadcast to all other peers who connect within some time window, and also be able to request missing changes since state x and have them show up at some point later once the peers with any missing changes since x have checked in. From cursory consideration, the Bloom filter implementation is slightly worse in this regard than vector clocks, for two reasons. (1) There's no way to simply broadcast new changes that you know no one else has, because they are newly created locally. (2) It sometimes requires multiple RTTs, and in a store-and-forward world, latency can be measured in days. Adding a side API to support calculating, serializing, broadcasting, and applying specific new changes seems fairly unproblematic. Possible multiple RTTs might just be a fact of life, and a cost worth paying for bug resilience and security.

ept commented 4 years ago

Thanks @josharian for your thoughtful comments.

my day job is at Tailscale

Ah, nice! I enjoyed your colleague's write-up of NAT traversal.

Will the network layer be stable after 1.0? And will it be recorded independently as a spec, as opposed to just "whatever the implementation does"?

The data format will be stable and documented. The network layer will continue to be separate from Automerge core, since people may want to use WebSocket, or WebRTC, or XMPP, or email, or a USB drive, or a variety of other forms of communication. But the idea is that there is one stable data format for changes and documents, and this can be transmitted over any type of network.

support a store-and-forward/mailbox style server […] There's no way to simply broadcast new changes that you know no one else has, because they are newly created locally.

Broadcasting new changes should be possible in any case, regardless of whether we use vector clocks or hash chains. After doing Automerge.change you get the change encoded as a byte array, which you can disseminate through any protocol. The difference between vector clocks and hash chains only shows up when establishing a connection between two nodes that have been disconnected for a while, and now they need to figure out how to get in sync.

It sometimes requires multiple RTTs, and in a store-and-forward world, latency can be measured in days

Rather than using the Bloom filter protocol end-to-end from client to client, it might be better to use the protocol to sync up one client with the server, and a separate run of the protocol to sync a different client with the server. This does mean that the server needs to speak the Bloom filter protocol, but it doesn't need to necessarily run a full instance of Automerge. This approach would even work in conjunction with end-to-end encryption, as long as we keep the dependency hashes unencrypted, and only encrypt the rest of each change message.

pvh commented 4 years ago

Hello and welcome, @josharian! I'd love to talk more about networking some time too.

As noted up-thread, the synchronization protocol makes most sense if you think about it as an optimization for live connections rather than a requirement.

In the worst case, send/fetch everything and then merge results. That said, I like @HerbCaudill's "what would it take to reach an inconceivably small risk of inaccurate results" exercise, and I have not yet seen any real-world use-cases where exchange costs are dominated by vector clock size in the current model.

ept commented 4 years ago

Here's a first proposal for what a sync API might look like:

Automerge.getCurrentVersion(doc) returns hashes of the most recent changes (may return multiple hashes if the most recent changes are concurrent). These unambiguously identify the current state of the document. If two nodes return the same result from this function, their documents will be in the same state. And conversely, two nodes that have applied the same set of changes (even in different order) will return the same result from this function.
Automerge.syncRequest(doc, lastSync) starts a sync with another node. lastSync is the document version (as returned by getCurrentVersion) when we last synced with this node (can be omitted if we have not synced with this node before). Returns the message to send to the other node as an Uint8Array. The document is not modified.
Automerge.syncResponse(doc, message) processes a message received from another node during sync. Returns an object {doc, response} where doc is the updated state of the document, and response is a message to send back to the other node (as an Uint8Array, or null if the sync is complete and no further message needs to be sent).

All the details of Bloom filters and hashes etc. are encapsulated in the messages; there is no need to expose these in the API. The changes are also encoded in the messages. A networking layer can sync many documents over a single connection by getting requests and responses for each document.

The above sync functions are for the case where two nodes have been disconnected for a while, and need to figure out what to send to each other to get consistent. In addition to the above, we will still have getChanges() and applyChanges() as they exist today (except using Uint8Arrays instead of JSON). These are good for real-time collaboration: whenever you make a change locally, use getChanges() to get an encoding of the change you just made, send it over the network, and every other node calls applyChanges() with that change when it is received. Moreover, we'll need a function for merging entire documents, as discussed in the thread.

This sync API would be quite simple to use, but I fear it may be too high-level for some use cases. For example, when syncing a repository with many documents, we may want to ensure that only documents that are modified need to be loaded from disk into memory, and unchanged documents do not need to be loaded from disk. I'm not sure how best to support that kind of pattern. Though that's probably something we can come back to later.

pvh commented 3 years ago

Currently, Hypermerge doesn't need to load documents at all to handle synchronization. If we know the target clock, we simply request all missing data. I think this property is extremely valuable, and agree that the proposed interface doesn't provide enough structure to its consumer.

Also, as you note, this API is not really a good fit for synchronizing multiple documents at once, which all of our applications have done.

I wonder whether this API could be improved simply by creating less opaque types for the input and output? For example, could the input be not a message but a "list of heads" and the response be a list of the changes that (probably) need sending?

HerbCaudill commented 3 years ago

This API makes a lot of sense to me. It seems to me that it gives a multi-document repo everything it needs to sync up.

Currently, Hypermerge doesn't need to load documents at all to handle synchronization. If we know the target clock, we simply request all missing data.

@pvh I'm not clear why this would be any different - instead of storing each document's clock, you store its hash. You can still sync a bunch of documents at once by collecting a bunch of sync messages into a single message, and on the other end collect all the change sets into a single response.

pvh commented 3 years ago

I can calculate vector clock differences trivially and without loading all the data into memory. It seems to me that in order to figure out what data I need to send I need to instantiate the document and ask automerge to traverse the full history, building the bloom filter. Maybe those can be saved and incrementally updated?

lightsprint09 commented 3 years ago

Can we close this?

ept commented 3 years ago

@lightsprint09 Yes! I will close it now, since it's implemented in #339. Thanks for the reminder.

automerge / automerge-classic

Automerge 1.0 sync protocol #290