automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

Sync protocol implementation (second version) #332

Closed ept closed 3 years ago

ept commented 3 years ago

Here is a second attempt at the sync protocol discussed in #290, superseding the first attempt in #324, after discussion with @pvh @orionz @alexjg. It addresses several issues with the first protocol, in particular, making the API more functional (avoiding side-effects) and giving the networking layer more control over encoding and scheduling of communication.

There is now only one type of message, a SyncMessage with the following structure:

interface SyncMessage {
  heads: Hash[]
  need: Hash[]
  have: SyncHave[]
}

interface SyncHave {
  lastSync: Hash[]
  bloom: Uint8Array
}

type Hash = string // 64-digit hex string

When a node sends one of these messages, it has the following meaning:

When two nodes connect to each other, they initially both send each other a SyncMessage to inform the other about their current state. If the two nodes are already in sync, there are no further messages. If the nodes need to send each other changes, they do so in response to this initial message, and the changes are also accompanied by another SyncMessage. Most of the changes are sent in this first response, but in some edge cases there can be a bit more back and forth of changes and SyncMessages until we have silence.

Whenever a node updates its document (because it updated the document itself, or because it received changes from another node), it informs all the nodes it is connected to by sending them each a SyncMessage, in exactly the same way as the initial message on connection. This triggers a fresh run of the protocol. Local changes can immediately be sent to all connected nodes, while changes received from another node should be exchanged through this protocol to avoid having a node unnecessarily receive the same changes several times from different nodes. If a document update happens in the middle of a sync protocol run, that's fine too.

Automerge implements the protocol with the following backend functions, four of which are new:

// New functions
function decodeSyncMessage(bytes: Uint8Array): SyncMessage
function encodeSyncMessage(message: SyncMessage): Uint8Array
function syncStart(state: BackendState, lastSync?: Hash[]): SyncMessage
function syncResponse(state: BackendState, message: SyncMessage): [SyncMessage, Uint8Array[]]

// Existing functions (getMissingDeps has been changed to take some extra arguments)
function getHeads(state: BackendState): Hash[]
function getMissingDeps(state: BackendState, changes?: Uint8Array[], heads?: Hash[]): Hash[]

Note that none of these functions modify the document state: they are all side-effect-free. The only effect on a document happens when the recipient of some changes takes them and calls Automerge.applyChanges() to apply them to a document.

Using the protocol does require some additional state: in particular, it must accumulate any changes that are received (using getMissingDeps() to identify any missing changes), and update the document when the set of changes is complete. It also needs to remember the heads from the last completed sync, and it needs to take some steps to avoid duplicate messages (since the functions above are stateless, they don't know whether an identical change or SyncMessage has already been sent previously). The intention is that with the API provided, this becomes a fairly simple state machine.

A simple example of such a state machine appears in the class SyncPeer at the top of test/sync_test.js, and this state machine is used extensively in the tests. I have not made the state machine part of Automerge's public API, because I suspect that different networking layers will want to implement the protocol state machine slightly differently (e.g. a client-server WebSocket implementation may well have different needs from a peer-to-peer implementation). My hope is that the API in this PR is general enough to satisfy all those use cases, but also powerful enough that networking layers don't need to do too much extra work.

Some discussion points:

Internally, each entry of the have field of a SyncMessage contains two things: lastSync are the heads at the time of the last successful sync with the recipient of the message, and bloom is a Bloom filter summarising all of the changes that the sender of the message has added since then. The details are explained in a paper. The API exposes Bloom filters only as byte arrays without any further structure, since the manipulation and interpretation of Bloom filters is completely encapsulated in the API.

syncResponse() returns all changes that it knows about and that satisfy at least one of the following conditions:

I believe that changes received from another node should not be applied to the document until all dependencies are satisfied. There are two reasons for this:

  1. A missing dependency (a "hole") may never be filled in. Say you partially sync a change history with another node, but that node disconnects before all of the dependencies are filled in; moreover, that node never comes back again and never syncs with any other node either. We will then forever have this incomplete change history as no node will ever be able to fill in the hole; we will forever keep asking for the missing change in the need field, but it will never be resolved. Having such eternal holes also makes it difficult to determine when the sync with another node has completed. Perhaps we could give up on holes after a timeout, but that makes the protocol more complicated.
  2. In the case where one node sends us an incomplete change history and then disconnects, it may seem like for efficiency reasons we should keep the partial change history we received from that node, and try to fill in the holes by syncing with another node. Even when this is possible, because another node has the missing changes, this actually doesn't work well with the Bloom filter sync protocol. The reason is that if one change is not present in the Bloom filter, we assume that all changes that depend on it are also not present. This assumption is important for getting good performance from the Bloom filters, since it drastically reduces the impact of false positives. With this rule about dependencies, any changes that depend on the hole will get re-sent anyway, even if they are present in the Bloom filter, and so the desired avoidance of re-sending changes we already have doesn't actually happen anyway.

Nevertheless, when performing a large sync, it would be annoying if we can't resume a partially completed sync that got interrupted due to an unreliable network connection, or if we can't display any data in an app until it has fully synced. To mitigate this, I think we could break a large sync into several smaller syncs, performed sequentially; then not much is lost if we disconnect during one of these smaller syncs. For example, each document or each head could be a separate sync. If there is a really long change history for a document, we could even break it up into chunks (say, sync 1MB of changes at a time, rather than jumping all the way to the heads in one sync).

pvh commented 3 years ago

A thought on "unfillable holes": perhaps these could be garbage collected during a save() compaction? This would let them hang around for a little while in the hopes of being restored but at whatever interval a developer chooses to run a compaction and save() the file they'd be polished away.

ept commented 3 years ago

@pvh With "garbage collecting holes", do you mean discarding all changes that depend transitively on the hole? That is possible, and in fact I believe the current save() already does this, since it only stores changes that are reachable from the heads, and changes with missing dependencies don't count towards the heads (they are only enqueued but not applied to the document). However, I'm still doubtful that it's a good idea to rely on this behaviour.

orionz commented 3 years ago

What about the idea to resolve holes with applyChanges rejecting changes with missing deps - (and specifying which holes).

orionz commented 3 years ago

Doing so means we can remove the getMissingChanges api call and push the responsibility of hole management to the developer

ept commented 3 years ago

What about the idea to resolve holes with applyChanges rejecting changes with missing deps - (and specifying which holes).

I think that would be a good idea; we've previously had cases of people getting confused when they called applyChanges and the document did not change because a dependency was missing. Making this an explicit error would be a better developer experience than silently buffering the change.

I'm not sure what the type signature for applyChanges should look like, though. We want it to either return a success status with an updated document, or an error status with a list of missing hashes. JavaScript exceptions are pretty bad so I don't really want to use an exception for the error status. Maybe we could make it return a kind of pseudo-tagged union type like this (in TypeScript types):

function applyChanges<T>(doc: Doc<T>, changes: Uint8Array[]): Success<T> | MissingDeps

interface Success<T> {
  status: 'success'
  doc: Doc<T>
}

interface MissingDeps {
  status: 'missingDeps'
  missingDeps: Hash[]
}

What do you think?

pvh commented 3 years ago

@ept a couple of minor questions/ideas

My focus right now has been on taking the example code in sync_test.js and trying to implement it in my little testbed application and I think I'm getting close. You can see that implementation here: https://github.com/pvh/automerge-demo but I warn you that it is very WIP and while I'm trying to only commit working versions it sort of grows bloat and then reconciles it again and I might have hand-edited my Automerge tree without pushing those changes publicly once or twice so if it doesn't work that's absolutely my fault and I'll aim to fix that before calling it a day here.

dan-weaver commented 3 years ago

I think that would be a good idea; we've previously had cases of people getting confused when they called applyChanges and the document did not change because a dependency was missing. Making this an explicit error would be a better developer experience than silently buffering the change.

this may be a quirk of our application but this would currently create race conditions for us. We rely on being able to immediately listen to changes on an incoming signal and then we can rest assured it’s ok that our naive sync protocol finishes getting the entire doc whenever it finishes delivery.

Possibly that’s not a legitimate problem after the sync protocol is landed though?

ept commented 3 years ago

It feels weird that the messages don't have the type baked into the binary for the wire protocol. Should we include a leading byte / bit that determines the message type?

@pvh They do have a leading byte (0x42, chosen arbitrarily) to indicate the message type. Happy to pick a different number if you don't like that one.

Should we add a more generic {encode/decode}Message pair of functions? (It could use that first byte to decide what message type it has.

Could do, although since we only have one message type at the moment, that seems a bit redundant.

We rely on being able to immediately listen to changes on an incoming signal and then we can rest assured it’s ok that our naive sync protocol finishes getting the entire doc whenever it finishes delivery.

@dan-weaver You could solve this by buffering changes until all the dependencies are satisfied. Something like this:

let receivedChanges = [], doc = Automerge.init()

function onReceiveChanges(changes) {
  receivedChanges = receivedChanges.concat(changes)
  const result = Automerge.applyChanges(doc, receivedChanges)
  if (result.status === 'success') {
    doc = result.doc
    receivedChanges = []
  } else if (result.status === 'missingDeps') {
    // do nothing, try again on next call to onReceiveChanges()
  } else {
    throw new Error(`Unexpected result status: ${result.status}`)
  }
}

…which would give you the current behaviour.

pvh commented 3 years ago

treatment of change messages vs. sync messages feels inconsistent

asking developers to manually check getMissingDeps feels incorrect waiting for a sync message to arrive before applying changes feels wrong too (particularly for single change messages) -> why shouldn't we apply all applicable changes and only publish the patch at a "suitable time"? waiting to update sync state until full sync is complete feels incorrect -> if i am getting 10,000 changes from a peer how do i know how many changes i'm waiting on for a progress bar?

naming & types

network protocol

febe protocol

sync state

changes

There are probably a few other smaller pieces but this covers a lot of it.

arj03 commented 3 years ago

This looks very interesting.

I was wondering if you have thought about extracting this functionality into a separate library so it can be reused in other projects? It seems that the API should be general enough.

Local changes can immediately be sent to all connected nodes, while changes received from another node should be exchanged through this protocol to avoid having a node unnecessarily receive the same changes several times from different nodes.

I was wondering if you have considered using a spanning tree like protocol such as plumtree, epidemic broadcast trees or Brisa for replication changes with multiple nodes. See this for more information.

ept commented 3 years ago

Hi @arj03,

I was wondering if you have thought about extracting this functionality into a separate library so it can be reused in other projects? It seems that the API should be general enough.

Potentially, although it does make some assumptions that may be specific to Automerge, such as the API provided for traversing the hash graph. My inclination would be to keep this functionality as part of the Automerge core for now, and we can consider splitting it out later if there is demand.

I was wondering if you have considered using a spanning tree like protocol such as plumtree, epidemic broadcast trees or Brisa for replication changes with multiple nodes.

I've looked at a few gossip / p2p overlay networks; our intention is that the API provided by Automerge core should be general enough to be used for many different types of network topology. There can then be several different networking layers (some using client-server websocket, some using p2p protocols, etc) that build upon the same core API. Apps can then choose a networking layer that is appropriate for their use case.

arj03 commented 3 years ago

Yeah I guess it makes sense to keep it in automerge especially as you fine tune the algorithm so that it fits the problem at hand as best as possible instead of trying to solve many different problems all at once.

I mentioned the articles because they specifically deals with the timeout and how to handle data from multiple peers problem. It is good to know you are aware of existing literature in that space. What I'm trying to say is that it seem like a separate concern from the main algorithm, again in automerge it might make sense to keep those two aspect very close because an end user doesn't really care about those details.

Anyway just wanted to provide some input, what you are doing here is really exciting!

lightsprint09 commented 3 years ago

Can we close this?

ept commented 3 years ago

@lightsprint09 Yes. Closing this because it is superseded by #339.