automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 467 forks source link

Efficient encoding of a batch of changes #309

Open Gozala opened 3 years ago

Gozala commented 3 years ago

My current understand is that new Automerge provides Automerge.save(doc) API that performs maximum compression producing bytes encoding full history of the document which can be shared with collaborators for merging.

Another alternative is to share individual changes with collaborators so they can apply them one by one.

I think it would be great to substitute "individual change" blobs with "individual changes" because:

ept commented 3 years ago

Hi @Gozala, thank you for this suggestion. It's already the case that a single Automerge change bundles together any number of individual updates — whatever is made within the scope of a single call to Automerge.change. So I guess what you want is an additional level of batching that allows the results of several changes to be combined into a single message, which is more compact than the individual changes concatenated.

The question is then what should happen with the hash graph in which each change references the hashes of its predecessors. If we batch together several changes, should those changes retain their original hashes, or should the batched changeset have a new hash? My opinion is that it would be better to retain the original hashes, so that two nodes agree on what the latest change hashes are, regardless of whether they synced through individual changes or through batched changes. (For example, git packfiles batch together several commits without changing their commit hashes. Also, the current compressed document format in Automerge has this property: it allows us to reconstruct the original change history, byte-for-byte identical to the original, so that the original hash graph can be recomputed.)

Assuming that we want the original change hashes to be preserved by batching, this implies that we cannot simply change the current "individual change" format to a plural "changes" format, because that would break the hashes. Instead, we would need to introduce a third data format that batches together multiple changes (in addition to the individual change and the whole compressed document). This batched format would need to preserve enough information to reconstruct its original constituent changes, allowing the hashes to be recomputed and checked.

At the moment, nodes that want to sync up have two options: either send a log of individual changes, or send the entire compressed document (which can be merged by the recipient). Sending the compressed document is preferable as soon as the log of individual changes becomes larger than the compressed document. A third data format, that batches a set of changes in a way that is more compact than the log of individual changes, is only worthwhile if it significantly reduces the bandwidth or CPU cost compared to the two existing options.

My focus so far has been on making the whole compressed document as compact as possible, so that the option of sending the whole document is a reasonable one. Nevertheless, there may be workloads in which both existing options are too expensive, and in that case it would make sense to have a third format that batches a set of changes. However, before going to the effort of implementing such a format, I would want to see experimental data showing on what sort of workloads this format would be beneficial, and by how much.

I will leave this issue open because I think it's an interesting one. If you have a workload on which the existing options are bad, and in which batching changes would help, I'd love to hear further details.

Gozala commented 3 years ago

Thanks @ept for thoughtful and detailed explanation. I understand now what you meant by third format. I do have some followup question though regarding this segment:

The question is then what should happen with the hash graph in which each change references the hashes of its predecessors. If we batch together several changes, should those changes retain their original hashes, or should the batched changeset have a new hash?

Could those hashes be derived rather than encoded into the batch ? What I was imaging (very likely due to lack of insight) is that individual changes could be packed, but then unpacked & hashed locally to arrive to the same hashes that would otherwise be referred by. I was assuming that is more or less what document compression did.

Either way it clearly introduces layer of complexity, so gathering evidence supporting the cause is indeed most reasonable approach here.

Nevertheless, there may be workloads in which both existing options are too expensive, and in that case it would make sense to have a third format that batches a set of changes.

My desire was driven less by the desire to optimize specific use case and more by a desire to have clear mental model that fits with ledger style update distribution. E.g. what I was hoping to do is use an IPFS network for replication. In which case application would create a dag in which each head node contains links to:

a. Batch of changes since last revision b. Snapshot of the current state c. Previous head

Then IPNS (which is like a git branch, but name is public key which you can point to any content hash given the private key) could be used to remap name to a new head when it is published. Idea here is that collaborating nodes:

  1. When if online will receive pubsub messages when new head is published (or will manually pull when they come online)
  2. Look at the head and decide to either:
    1. Fetch the new snapshot
    2. Fetch the new changes (which may imply multiple round-trips)
  3. Then apply either snapshot or batched changes to the local replica

In such a design publishing each change individually seems like a bad idea (maybe it is not), and something like a batch of changes seems like a better fit.

Which is what lead me to thinking that Automerge.diff(a, b) might be a better API choice, even if it does not do any clever compaction, it forces a library user to think when to calculate the delta (on each change or less frequently) as opposed to having to recognize that in some cases you'd be better of distributing changes in batches.

With all that said, I think doing user space experiments first seems like a very reasonable approach to take here.

ept commented 3 years ago

Could those hashes be derived rather than encoded into the batch?

Yes, and the compressed document format does exactly what you suggest: it doesn't store hashes of each individual change, but when decompressing it reconstructs the original changes and recomputes their hashes. Storing hashes would totally destroy the compression: in the text editing benchmark I've been using, the document compresses to about 1 byte per change. If we also stored a 32-byte hash for each change, that would immediately make the document 33 times larger!

My desire was driven less by the desire to optimize specific use case and more by a desire to have clear mental model that fits with ledger style update distribution.

That makes sense. I think the current model fits what you want pretty well: you store a snapshot (using whole-document compression) plus a log of individual changes that have happened since that snapshot. Every time the size of the log of changes exceeds the size of the snapshot, you fold the log into the snapshot and produce a new compressed snapshot. This gives you a nice and simple append-only-log model for incremental updates, and also an efficient way for clients to catch up if they have been disconnected for a while.

You can use Automerge.save() to get the snapshot, and Automerge.getChanges() to get the log of individual changes (all returned as byte arrays). If writing to a file, you can simply concatenate these things; the data format contains length tags that allows it to be cleanly parsed after concatenation. Concatenation is therefore the simplest way of batching individual changes. Of course it will be less compact than a dedicated format for a batch of updates, but I think it would be a viable way of getting started.