Repository API proposal

At the moment Automerge only provides an API for an in-memory data structure, and leaves all I/O (persistence on disk and network communication) as an "exercise for the reader". The sync protocol attempts to provide an API for network sync between two nodes (without assuming any particular transport protocol), but experience has shown that users find the sync protocol difficult to understand, and easy to misuse (e.g. knowing when to reset the sync state is quite subtle; sync with more than one peer is also error-prone).

I would like the propose a new API concept for Automerge, which we might call "repository" (or "database"?). It should have the following properties:

A repository is not tied to any particular storage or networking technology, but it should provide interfaces that make it easy to plug in storage (e.g. embedded DB like IndexedDB or SQLite, remote DB like Postgres or MongoDB, or local filesystem) and network transport (e.g. WebSocket, WebRTC, IPFS/libp2p, plain HTTP) libraries. These should be able to interoperate across platforms, so you can have e.g. an iOS app (using local filesystem storage) syncing its state with a server written in Python (which stores its state in Postgres). We should provide some integrations for commonly used protocols, but also allow customisation.
The basic data model of storage should read and write byte arrays corresponding to either changes or compressed documents. The repository should automatically determine when to take a log of changes and compact it into a document. The repository may also choose to create and maintain indexes for faster reads.
The basic model of networking is either a one-to-one connection-oriented link between two nodes running the sync protocol (e.g. WebSocket, WebRTC), or a best-effort multicast protocol (e.g. BroadcastChannel for tabs within the same browser, a pub-sub messaging system, or a gossip protocol). Multiple protocols may be used at the same time (e.g. gossip/pub-sub for low-latency updates, and one-to-one sync to catch up on updates that a node missed it was offline).
A repository is a collection of many Automerge documents, and all the storage and networking adapters should be multi-document out of the box. The collection of docs in a repo might be too big to load into memory. A repository should be able to efficiently sync the entire collection without loading it into memory: only reading/writing the necessary docs and changes in the storage layer, and sending them over the network in the form they are stored, without instantiating all the in-memory CRDT data structures for those documents.
The application should be able to choose which documents in the collection to load into memory, and when to free them again. Even if we rely on APIs such as WeakRefs to free Wasm memory when it is no longer needed, it probably makes sense for the application to explicitly signal when to load and free a document: in JS, loading would be async (since e.g. IndexedDB APIs are async), and so any function that might need to load a document would also have to be async. If the app specifies when to load and free a document, then the API for accessing a loaded document can be sync.
There should be an API where applications can plug in their access control policy, so that for example, they sync the entire repository contents when talking to another device belonging to the same user, but only sync specifically shared documents when talking to another user. The policy also determines from which remote nodes changes should be accepted, and from which they are ignored.
The repository should also provide a way of communicating ephemeral state, such as user presence (who is currently online), cursor positions, and position updates during drag & drop (notifying users every time a dragged object is moved by a pixel, many times per second). This is information that does not need to be persisted in the CRDT, but it does need to be sent to users who are currently online, and since we're already managing the network communication in the repository, it makes sense to also include the channel for ephemeral updates here.
The repository also provides a good place to register observers (callback functions that are invoked when a document changes). The current observable API is strange because you register the observer on a document, which is an immutable object, and it doesn't really make sense to observe something that is immutable. It makes much more sense for the callback to be registered on the repository.

Some thoughts on what the repository API might look like:

When the app starts up, create a repository object and register the storage library you want to use. When any sort of network link is established, you also register it with the repository; when it disconnects, it automatically unregisters itself from the repository. The repository object is typically a singleton that exists for the lifetime of the app process.
To create a document, instead of calling Automerge.init() you would call repository.create(). The new document would automatically be given a unique docId.
To load an existing document, instead of calling Automerge.load() and passing in a byte array, you would call await repository.load(docId), which loads the document with the given docId from the registered storage library.
Instead of calling Automerge.change(doc, callback) to make a change, call repository.change(docId, callback), which automatically writes the new change to persistent storage and sends it via any network links that are registered on the repository. The callback can be identical to the current Automerge API.
A loaded document is an immutable object, just like in the current API. Updates to a document result in a new document object that structure-shares any unchanged parts with the previous object. This allows the document to play nicely with React. The current state of a loaded document is available through e.g. repository.get(docId).
Instead of handling incoming changes from another user in application code and calling Automerge.applyChanges() to update the document, the repository automatically receives incoming changes via its registered network links. The application should register an observer, e.g. using repository.observeChanges(callback), to re-render the UI whenever a document changes.
We remove the current sync protocol API, and fold its functionality into the repository and its network interface instead. We might keep the current getChanges/applyChanges API for potential advanced use cases that are not satisfied by the repository API, but the expectation would be that most app developers use the repository API.

Need to do some further thinking on what the APIs for storage and networking interfaces should look like.

One inspiration for this work is @localfirst/state, but I envisage the proposed repository API having a deeper integration between storage and sync protocol than is possible with the current Automerge API, in order to make sync of large document sets as efficient as possible.

Feedback on this high-level outline very welcome. If it seems broadly sensible, we can start designing the APIs in more detail.

As someone who doesn't spend much time writing JS I'm intrigued by the more protocol-ey things here. Here are a few thoughts that occur to me

Currently the sync protocol doesn't support multiple documents. Would this proposal be looking to extend the protocol to cover multiple documents?
Making the repository a central part of the API seems like an opportunity to move the management of causal delivery out of the automerge document and into the repository. This would make the behaviour of applyChanges more predictable (right now it can be tricky to figure out which automerge change caused an observed state change) and would also make things more efficient in the case of changes which need to be delivered to multiple documents, but it would be a breaking change to the current API.
Would it be at all interesting to support cryptographic authn/authzn in the repository protocol? I am currently working on a system where every change is signed with respect to a PKI. If the storage model is just change or document byte arrays then I would have to separately ship signatures around, it would be nice if these things could be included in the protocol. Likewise, I'm interested in building systems which use something like Fission's UCAN to determine whether a change is authorized, in this case I would like to attach a proof that a change is authorized to the change itself. This seems like it is probably a case of just allowing arbitrary extra data to be sent in sync messages for each change?
Regarding access control - is there a mechanism in the sync protocol to say "don't send me this change"? If my access control policy disallows a particular change I would quite like to be able to say "don't send me any descendants of this change" so that I don't waste bandwidth on changes I can never apply.

Would this proposal be looking to extend the protocol to cover multiple documents?

Yes. In the simplest case, this could use essentially the current sync protocol, with each message tagged with the docId it refers to. In the common case where most docs are unchanged since the last sync, this would involve the peers exchanging heads hashes for each document. For large collections of docs, sending a hash per doc could be a bit inefficient; if we want to optimise that case, we could aggregate heads hashes of different documents in a Merkle tree. Specifically, I think a Merkle search tree would work well here.

move the management of causal delivery out of the automerge document and into the repository

Maybe… I think it's nice that if you currently have an Automerge file containing a bunch of changes, you can just load it and it will load correctly even if there are duplicates and changes out of causal order (so you can concatenate files without having to think about it too hard), so I think it still makes sense to have causal ordering facilities at the document level. But we could say that the queueing of changes that are not yet causally ready could happen at the repository level.

Would it be at all interesting to support cryptographic authn/authzn in the repository protocol?

That would be nice; we'd want to do it in some way that doesn't assume one particular scheme, but gives apps the freedom to extend the protocol with whatever scheme they need. Maybe this could be done with a kind of "middleware" API that can add arbitrary additional data to each protocol message, similarly to what some web frameworks do?

is there a mechanism in the sync protocol to say "don't send me this change"?

Not currently, but that seems like a reasonable extension to consider.

Other ideas that have come up:

As part of the multi-doc sync protocol it might also be useful to include support for syncing arbitrary binary blobs. That way, you could have e.g. a rich text editor that supports embedded images, and the image data could be synced over the same network link as the CRDT data.
We have long talked about wanting to support Git-style branching and merging workflows, and the repository seems like a good place for managing those branches. One option would be to make each branch a separate docId, but ensure those branches share the underlying storage so that we don't needlessly duplicate stuff. Better might be to make branches explicit in the API, so to get the latest version of a doc you could call repository.get(docId, branchId), with a default branchId for single-branch docs. The repository can then have methods for creating, merging, and deleting branches, and the sync protocol needs to be extended to handle branches.
Deleting documents in the repository also needs to be supported, and the sync protocol needs to be extended to propagate such deletions.

Something that I've been working on in my own project is migrations. That has been historically challenging and I wonder what opportunities thinking about this abstraction can provide us with to address this issue. Some of the most challenging aspects are not storage/sync related, but it's possible that migrations that cross document boundaries might want to be at the level of "repository".

You're right, @scotttrinh, we're certainly going to have to deal with the problem at some point but I think it's quite important to maintain some good separation of concerns and not just have this class take on either all the features or all the scope of the various pieces we're missing. Our next research project is going to be Cambria-adjacent, I think, but we haven't nailed down the scope yet so I'm not sure where it will lead.

I believe the first step is to get a simple multi-document class implemented which can connect to a single storage engine and to a single network. That should give us a good starting point to expand from. I want to be a bit cautious about pre-designing the class too much because my experience using the storage adapters in the dat ecosystem was that they were closely tied to a particular kind of storage engine and poorly suited to others.

I plan to implement something rudimentary over the next few days (time allowing) put together for some initial feedback. We'll need it for our next project anyway.

The new document would automatically be given a unique docId.

How would that play out with RDBMS that generate their IDs? Sure, it should be possible to add an extra column and index it, but that sounds a bit wasteful as unique IDs are guaranteed.

How would that play out with RDBMS that generate their IDs?

In systems where you have a single authoritative DB server it might make sense to let that server generate the IDs, but in general, in a decentralised system it needs to be possible for clients to generate their own IDs without depending on a particular server for ID assignment.

Different systems will definitely want to assign names differently -- for example, IPFS has content addresses and hypercores use a signing public key. In the sketch @HerbCaudill and I put together over the weekend we let you provide an ID-generator as an argument to the Repo API, but it may just be better to have users provide IDs at document creation time. (I think that's something we'll want to feel out.)

Heya @ept! We met at HYTRADBOI and you pointed me towards this issue about the repository API. I work at https://github.com/athensresearch/athens, and formerly I was at https://roamresearch.com/. Athens works as an optimistically updated shared document, whose operations are partially CRDT-like. We talked about how Athens might use CRDTs instead of its custom data structures.

I've been thinking a lot about this since HYTRADBOI, and especially this repository proposal. A lot of what's described here comprises the core complexity in our product and it would be great to move that out.

But it also strikes me as interesting that the document sync in Athens is in fact a partial implementation of this repository API, albeit not over CRDTs.

The basic data model of storage should read and write byte arrays corresponding to either changes or compressed documents. The repository should automatically determine when to take a log of changes and compact it into a document. The repository may also choose to create and maintain indexes for faster reads.

Athens synchronizes append-only event logs. This is somewhat straightforward because of the append-only nature of them - you basically always want to go to the tip. Events are forwarded to clients to bring them up to date. But when clients start, they get a snapshot / materialized view of the document so they don't have to load all the events.

CRDTs themselves function differently, and offer a richer change model than an event log. But for the purpose of storing, loading, and syncing changes for a single CRDT instance, I think it is fundamentally the same as an append-only event log:

changes arrive in a certain order
the (not necessarily ordered?) set of all changes determines the CRDT state
loading all the changes in an event log should restore that CRDT to the saved state
snapshots can be used to speed up the loading process

In fact, I think that is what the matrix-crdt provider for Yjs does. The core difference being that matrix-crdt is an event log first, and only a CRDT second, meaning that all changes go to the event log and only then to the CRDT clients.

I think this isn't great for communicating between CRDTs instances, since CRDTs provide a richer sync model than event logs. But it seems good for a repository model, where the goal is to store and restore a given document, and then leave cross-node sync for the document to do.

@filipesilva Yes, Automerge changes essentially form an event log. However, there are some important optimisations, because we allow every single keystroke to be a separate event (for real-time collaboration), which means that you can accumulate hundreds of thousands of events over the history of a single document. Storing each event individually would mean the history quickly grows into the megabytes.

Automerge puts a lot of effort into compressing that history so that it can be stored and transmitted over the network efficiently. When you do Automerge.save(), it actually contains the full event log, but for typical text editing pattern it takes less than 1 byte per keystroke. We're planning to also add an incremental compression step so that you can take a block of events (a section of the history) and store them in a single compressed blob.

It would be possible to use an append-only storage and networking model, but it wouldn't be able to take advantage of this compression. With this repository API we're trying to set things up so that apps can easily take advantage of Automerge's compression.

Okay, just a few notes from my ongoing work on this. The first big change from how this is proposed is the relationship between the network and the repository and the repository and the sync engine. In my initial prototype these were all coupled together as described above.

Unfortunately, when assessing how to integrate such an object into existing applications (or proposed ones we discussed) it became clear that having the network and synchronizer embedded entirely inside the Repo makes it tricky to extend the system or make varying decisions about how these systems should interact.

My new prototype decouples these systems. The Repo is now a relatively small object that allows listeners to be notified when documents are created or loaded and returns handles to the documents it tracks so that interested parties can be track their changes. In addition to the Repo, there are also Networking, Synchronization, and Storage subsystems that can interact with the repo in different ways depending how you want to put your application together. (I will include one or two packagings of these ideas to make them easy to consume.)

Finding these APIs and picking a comfortable idiomatic JS style is an ongoing process and I'm not entirely happy with where I'm at right now, but if you're interested in following along I'm occasionally pushing my work-in-progress implementation here.. Feedback is welcome but I have a pretty clear vision of what needs doing at the moment.

This is an interesting proposal. We’ve been working on an application that uses automerge as one of the core data structures, so I thought some perspective from our experience might be helpful. We implemented some of the same functionality for our app. For example, we abuse Redux as a repository and it manages applying changes to in memory automerge objects, somewhat like the repository.change() proposal. I don't yet have specific recommendations, but I wanted to bring up a few more details that I think may help inform this proposal.

Even if some of these details can be abstracted away by the API, it may be useful for backend implementers to take note. To start with a concrete scenario; if the frontend boots up by asking the backend “show me 50 of my most recent items”. How does that query, or the results of that query interact with this repository API?

For context, our application is a record management tool. Think of it as sitting between a spreadsheet and a complicated case management system. (I’ll try to remember to link to it here when we do our soft release in a few weeks). We want to support realtime collaboration as well as offline use, but the more common usage is expected to be online, in a browser, with the server helping to provide query/search functionality as well as access controls like most webapps (the server is like a big trusted client, so all those features are designed to work locally offline as well, though that’s not fully implemented yet).

Here are some thoughts I have given our experience.

Most applications are probably working with lists of items.
- Those lists are produced by search infrastructure or database queries. That functionality needs to store and index a “rendered’ view of an automerge object, even if they’re also storing the actual serialized automerge data.
- Depending on the use cases it may be more efficient to ship those serialized automerge documents along with the query results, as opposed to querying for them separately.
- Since backend needs to work with “rendered” data, it’s likely that the frontend just needs to display that data directly without needing to deserialize the automerge documents. In this case it’s more efficient to hold off deserializing until the user actually wants to edit the object.
- If the backend is applying user changes for its own rendering purposes and then updating the query results in real time, then it still might be better for the backend to just ship the new query results (ideally diffing in some way), rather than the frontend listening to changes to all the query results and applying them itself. (again saving on the need to deserialize the objects)
- The backend could index new objects that change query results. The frontend doesn’t know about those objects yet and so can’t listen to those changes anyway.
Most “documents” are actually smallish and won’t have a ton of edits
- Automerge is a really useful data structure. I suspect that most UX will be better served by many smaller objects rather than one large object. This might even end up being true in applications that look like classic word processors.
- N.B. This is a good reason why some abstraction for undo-history across objects is more important than undo-history within a single object.
- If most documents are small then it’s plausible that the “best” way to sync is to just transfer the whole document and merge it locally rather than bothering with diffs.
- BUT, in a live keystroke-for-keystroke multiplayer session just listening to a log of change objects makes the most sense.
- Because they’re small, it might make sense for the serialized objects to just transfer along with query results.
- To support all the functionality the backend may store the “rendered” object alongside the serialized object. (e.g. in the same database row)
In a live editing environment it’s not uncommon for every keystroke to be sent to the other clients individually in real time.
- The metadata around transferring and storing the change is going to be larger than the keystroke and doesn’t benefit from all the data format optimizations that happen inside automerge.
- In practice you probably need to store the change somewhere at least for a little bit to account for the time between the backend receiving the change, applying the change, serializing the object, and writing it to durable storage.
- Storing those change objects probably needs to implement all the indexing overhead as full documents. (e.g. only certain users are probably authorized to listen to certain changes. Or a user might query for a certain set of changes).
- If a user’s changes are being sent to a backend you probably want to implement the optimization of the user not getting their own changes back again from the backend, or at the very least not applying their own incoming changes again. (unless they’re reloading all the data again due to an app refresh)
- So, when looking up a document a new client will actually need to load the latest serialized data plus any changes that haven’t been applied yet.

I can go into more detail about data structures and whatnot that we settled on, but this is already long enough. Hopefully the above summary is helpful.

Thanks for the comments, Rob. I think I agree with most of this though I want to be cautious not to set the expectation that we'll Solve All The Things with this one patch. A lot of this perspective is pretty high-level, and I'd be curious to hear in a more holistic sense what your current biggest experienced pain points are.

Hello, what is the status of this? I saw this repository https://github.com/pvh/automerge-repo and it's fairly active. How usable is it and would it stay as a separate package to automerge or is the idea to merge it in as a part of automerge?

The repo will be a separate repo but will likely move to the Automerge namespace. Initial adoption is welcome, but expect bugs and breaking changes as the system settles.

At this point I suspect what's there is significantly easier to work with than implementing things yourself. The documentation is pretty thin, but the automerge-repo-react-demo should be pretty easy to follow.

No npm packages yet but soon.

Feel free to hit me up on Slack for a chat if you have questions. I'm going to be on Central US time (GMT-5) for a week or so.

On Wed, Sep 21, 2022, 9:17 AM LiraNuna @.***> wrote:

Hello, what is the status of this? I saw this repository https://github.com/pvh/automerge-repo and it's fairly active. How usable is it and would it stay as a separate package to automerge or is the idea to merge it in as a part of automerge?

— Reply to this email directly, view it on GitHub https://github.com/automerge/automerge/issues/486#issuecomment-1253934312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWQF6AXNRVUJREFXVQPTV7MYJXANCNFSM5USEBPOQ . You are receiving this because you commented.Message ID: @.***>

automerge / automerge-classic

Repository API proposal #486