automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

A few questions/thoughts on the design in the performance branch #414

Closed dvc94ch closed 1 year ago

dvc94ch commented 3 years ago
  1. the use of json for communicating between frontend and backend and binary encoding in the backend

I think this might make sense if you assume js is used as the frontend language, the json implementation in most browsers is highly optimized and written in c++. For every other language at least using something like cbor seems more sensible.

  1. use of hashes for encoding dependencies

that is 32 bytes of incompressible data per dependency. I think it is possible to build efficient replicated streams (hypercore or blake-streams [0] for example) which don't require content addressing. A tuple (actor_id/public_key, stream_id, offset) would also work. The public_key/stream_id can be indexed and the resulting (index, offset) tuple would be highly compressible requiring only a few bytes.

  1. is the actual crdt documented?

I'd really like to understand how it behaves and what kind of weirdness I can expect when designing an application using this crdt.

ept commented 3 years ago

Hi @dvc94ch, thanks for your questions.

the use of json for communicating between frontend and backend and binary encoding in the backend

For connecting the Rust backend to the Swift frontend we're already exploring using MessagePack instead of JSON. Experiments will show which encoding is the best one to use here.

use of hashes for encoding dependencies, that is 32 bytes of incompressible data per dependency

In the compressed document representation we do compress dependencies, using a simple scheme: we put all the changes in some order, number them sequentially using that order, and then express dependencies using a small integer indicating the position of the dependency in that list. When checking the hashes we recompute the hash of each change.

BLAKE3 streams are cool, but I'm not sure they really help for Automerge, since they are built on the assumption that all the input being hashed is in a linear order. In Automerge, we get branching and merging due to concurrency, so we don't have the same kind of linear input. You do get parts of the graph that are linear, and in principle BLAKE3 streaming could be used here, but it would be a lot of additional complexity and I suspect the gain would be limited.

Other features of BLAKE3 are appealing, such as its good SIMD support, but I'm not sure this is enough to justify changing the hash function.

is the actual crdt documented?

Not very well right now, sorry. But we are planning to write a paper about Automerge soon, which will discuss the internals in detail. In brief: list/text insertions are handled using RGA; every list element and every key in a map is a multi-value register with a default value picked through last-writer-wins; nested objects are represented by giving each object a unique ID, and the parent object contains a reference to the child object.

dvc94ch commented 3 years ago

You do get parts of the graph that are linear, and in principle BLAKE3 streaming could be used here, but it would be a lot of additional complexity and I suspect the gain would be limited.

I think that is missing the point. Each replica (or actor) has it's own stream which is identified by a public key and a numeric stream id. There are some downsides to this approach, which I hope can be mitigated or improved upon in the future [0]. The basic idea here is to let the networking layer take care of authentication instead of embedding uncompressible hashes. This means that if the transport layer screws up it's security properties the replicas will diverge.

dvc94ch commented 3 years ago

Anyway, started working on an experimental db using blake-streams and automerge. It should illustrate how it is intended to work, just need to figure out how to convince automerge to accept my changes. Currently fails with "missing object id". The relevant code is here [0].

ept commented 1 year ago

Closing this issue as it has been quiet for a long time; please re-open if there is still something to do here.