automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.77k stars 467 forks source link

Repository API proposal #486

Open ept opened 2 years ago

ept commented 2 years ago

At the moment Automerge only provides an API for an in-memory data structure, and leaves all I/O (persistence on disk and network communication) as an "exercise for the reader". The sync protocol attempts to provide an API for network sync between two nodes (without assuming any particular transport protocol), but experience has shown that users find the sync protocol difficult to understand, and easy to misuse (e.g. knowing when to reset the sync state is quite subtle; sync with more than one peer is also error-prone).

I would like the propose a new API concept for Automerge, which we might call "repository" (or "database"?). It should have the following properties:

Some thoughts on what the repository API might look like:

Need to do some further thinking on what the APIs for storage and networking interfaces should look like.

One inspiration for this work is @localfirst/state, but I envisage the proposed repository API having a deeper integration between storage and sync protocol than is possible with the current Automerge API, in order to make sync of large document sets as efficient as possible.

Feedback on this high-level outline very welcome. If it seems broadly sensible, we can start designing the APIs in more detail.

alexjg commented 2 years ago

As someone who doesn't spend much time writing JS I'm intrigued by the more protocol-ey things here. Here are a few thoughts that occur to me

ept commented 2 years ago

Would this proposal be looking to extend the protocol to cover multiple documents?

Yes. In the simplest case, this could use essentially the current sync protocol, with each message tagged with the docId it refers to. In the common case where most docs are unchanged since the last sync, this would involve the peers exchanging heads hashes for each document. For large collections of docs, sending a hash per doc could be a bit inefficient; if we want to optimise that case, we could aggregate heads hashes of different documents in a Merkle tree. Specifically, I think a Merkle search tree would work well here.

move the management of causal delivery out of the automerge document and into the repository

Maybe… I think it's nice that if you currently have an Automerge file containing a bunch of changes, you can just load it and it will load correctly even if there are duplicates and changes out of causal order (so you can concatenate files without having to think about it too hard), so I think it still makes sense to have causal ordering facilities at the document level. But we could say that the queueing of changes that are not yet causally ready could happen at the repository level.

Would it be at all interesting to support cryptographic authn/authzn in the repository protocol?

That would be nice; we'd want to do it in some way that doesn't assume one particular scheme, but gives apps the freedom to extend the protocol with whatever scheme they need. Maybe this could be done with a kind of "middleware" API that can add arbitrary additional data to each protocol message, similarly to what some web frameworks do?

is there a mechanism in the sync protocol to say "don't send me this change"?

Not currently, but that seems like a reasonable extension to consider.

ept commented 2 years ago

Other ideas that have come up:

scotttrinh commented 2 years ago

Something that I've been working on in my own project is migrations. That has been historically challenging and I wonder what opportunities thinking about this abstraction can provide us with to address this issue. Some of the most challenging aspects are not storage/sync related, but it's possible that migrations that cross document boundaries might want to be at the level of "repository".

pvh commented 2 years ago

You're right, @scotttrinh, we're certainly going to have to deal with the problem at some point but I think it's quite important to maintain some good separation of concerns and not just have this class take on either all the features or all the scope of the various pieces we're missing. Our next research project is going to be Cambria-adjacent, I think, but we haven't nailed down the scope yet so I'm not sure where it will lead.

I believe the first step is to get a simple multi-document class implemented which can connect to a single storage engine and to a single network. That should give us a good starting point to expand from. I want to be a bit cautious about pre-designing the class too much because my experience using the storage adapters in the dat ecosystem was that they were closely tied to a particular kind of storage engine and poorly suited to others.

I plan to implement something rudimentary over the next few days (time allowing) put together for some initial feedback. We'll need it for our next project anyway.

LiraNuna commented 2 years ago

The new document would automatically be given a unique docId.

How would that play out with RDBMS that generate their IDs? Sure, it should be possible to add an extra column and index it, but that sounds a bit wasteful as unique IDs are guaranteed.

ept commented 2 years ago

How would that play out with RDBMS that generate their IDs?

In systems where you have a single authoritative DB server it might make sense to let that server generate the IDs, but in general, in a decentralised system it needs to be possible for clients to generate their own IDs without depending on a particular server for ID assignment.

pvh commented 2 years ago

Different systems will definitely want to assign names differently -- for example, IPFS has content addresses and hypercores use a signing public key. In the sketch @HerbCaudill and I put together over the weekend we let you provide an ID-generator as an argument to the Repo API, but it may just be better to have users provide IDs at document creation time. (I think that's something we'll want to feel out.)

filipesilva commented 2 years ago

Heya @ept! We met at HYTRADBOI and you pointed me towards this issue about the repository API. I work at https://github.com/athensresearch/athens, and formerly I was at https://roamresearch.com/. Athens works as an optimistically updated shared document, whose operations are partially CRDT-like. We talked about how Athens might use CRDTs instead of its custom data structures.

I've been thinking a lot about this since HYTRADBOI, and especially this repository proposal. A lot of what's described here comprises the core complexity in our product and it would be great to move that out.

But it also strikes me as interesting that the document sync in Athens is in fact a partial implementation of this repository API, albeit not over CRDTs.

The basic data model of storage should read and write byte arrays corresponding to either changes or compressed documents. The repository should automatically determine when to take a log of changes and compact it into a document. The repository may also choose to create and maintain indexes for faster reads.

Athens synchronizes append-only event logs. This is somewhat straightforward because of the append-only nature of them - you basically always want to go to the tip. Events are forwarded to clients to bring them up to date. But when clients start, they get a snapshot / materialized view of the document so they don't have to load all the events.

CRDTs themselves function differently, and offer a richer change model than an event log. But for the purpose of storing, loading, and syncing changes for a single CRDT instance, I think it is fundamentally the same as an append-only event log:

In fact, I think that is what the matrix-crdt provider for Yjs does. The core difference being that matrix-crdt is an event log first, and only a CRDT second, meaning that all changes go to the event log and only then to the CRDT clients.

I think this isn't great for communicating between CRDTs instances, since CRDTs provide a richer sync model than event logs. But it seems good for a repository model, where the goal is to store and restore a given document, and then leave cross-node sync for the document to do.

ept commented 2 years ago

@filipesilva Yes, Automerge changes essentially form an event log. However, there are some important optimisations, because we allow every single keystroke to be a separate event (for real-time collaboration), which means that you can accumulate hundreds of thousands of events over the history of a single document. Storing each event individually would mean the history quickly grows into the megabytes.

Automerge puts a lot of effort into compressing that history so that it can be stored and transmitted over the network efficiently. When you do Automerge.save(), it actually contains the full event log, but for typical text editing pattern it takes less than 1 byte per keystroke. We're planning to also add an incremental compression step so that you can take a block of events (a section of the history) and store them in a single compressed blob.

It would be possible to use an append-only storage and networking model, but it wouldn't be able to take advantage of this compression. With this repository API we're trying to set things up so that apps can easily take advantage of Automerge's compression.

pvh commented 2 years ago

Okay, just a few notes from my ongoing work on this. The first big change from how this is proposed is the relationship between the network and the repository and the repository and the sync engine. In my initial prototype these were all coupled together as described above.

Unfortunately, when assessing how to integrate such an object into existing applications (or proposed ones we discussed) it became clear that having the network and synchronizer embedded entirely inside the Repo makes it tricky to extend the system or make varying decisions about how these systems should interact.

My new prototype decouples these systems. The Repo is now a relatively small object that allows listeners to be notified when documents are created or loaded and returns handles to the documents it tracks so that interested parties can be track their changes. In addition to the Repo, there are also Networking, Synchronization, and Storage subsystems that can interact with the repo in different ways depending how you want to put your application together. (I will include one or two packagings of these ideas to make them easy to consume.)

Finding these APIs and picking a comfortable idiomatic JS style is an ongoing process and I'm not entirely happy with where I'm at right now, but if you're interested in following along I'm occasionally pushing my work-in-progress implementation here.. Feedback is welcome but I have a pretty clear vision of what needs doing at the moment.

rongoro commented 2 years ago

This is an interesting proposal. We’ve been working on an application that uses automerge as one of the core data structures, so I thought some perspective from our experience might be helpful. We implemented some of the same functionality for our app. For example, we abuse Redux as a repository and it manages applying changes to in memory automerge objects, somewhat like the repository.change() proposal. I don't yet have specific recommendations, but I wanted to bring up a few more details that I think may help inform this proposal.

Even if some of these details can be abstracted away by the API, it may be useful for backend implementers to take note. To start with a concrete scenario; if the frontend boots up by asking the backend “show me 50 of my most recent items”. How does that query, or the results of that query interact with this repository API?

For context, our application is a record management tool. Think of it as sitting between a spreadsheet and a complicated case management system. (I’ll try to remember to link to it here when we do our soft release in a few weeks). We want to support realtime collaboration as well as offline use, but the more common usage is expected to be online, in a browser, with the server helping to provide query/search functionality as well as access controls like most webapps (the server is like a big trusted client, so all those features are designed to work locally offline as well, though that’s not fully implemented yet).

Here are some thoughts I have given our experience.

I can go into more detail about data structures and whatnot that we settled on, but this is already long enough. Hopefully the above summary is helpful.

pvh commented 2 years ago

Thanks for the comments, Rob. I think I agree with most of this though I want to be cautious not to set the expectation that we'll Solve All The Things with this one patch. A lot of this perspective is pretty high-level, and I'd be curious to hear in a more holistic sense what your current biggest experienced pain points are.

LiraNuna commented 1 year ago

Hello, what is the status of this? I saw this repository https://github.com/pvh/automerge-repo and it's fairly active. How usable is it and would it stay as a separate package to automerge or is the idea to merge it in as a part of automerge?

pvh commented 1 year ago

The repo will be a separate repo but will likely move to the Automerge namespace. Initial adoption is welcome, but expect bugs and breaking changes as the system settles.

At this point I suspect what's there is significantly easier to work with than implementing things yourself. The documentation is pretty thin, but the automerge-repo-react-demo should be pretty easy to follow.

No npm packages yet but soon.

Feel free to hit me up on Slack for a chat if you have questions. I'm going to be on Central US time (GMT-5) for a week or so.

P

On Wed, Sep 21, 2022, 9:17 AM LiraNuna @.***> wrote:

Hello, what is the status of this? I saw this repository https://github.com/pvh/automerge-repo and it's fairly active. How usable is it and would it stay as a separate package to automerge or is the idea to merge it in as a part of automerge?

— Reply to this email directly, view it on GitHub https://github.com/automerge/automerge/issues/486#issuecomment-1253934312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWQF6AXNRVUJREFXVQPTV7MYJXANCNFSM5USEBPOQ . You are receiving this because you commented.Message ID: @.***>