On Peers in Automerge-Repo: Design

pvh commented 2 months ago

As Automerge Repo becomes more mature, we're starting to feel some pain around some underlying confusion in the set of primitives we're exposing. I'm going to start by describing what I think are the actual primitives and then get into the problems.

Repo: A repo is an instance of a class that has storage and network adapters and stuff loaded into memory. Importantly, there can be many Repos sharing a single storage location. This happens if you have many browser tabs sharing an IndexedDB, or multiple servers sharing a backend storage pool like a Redis or S3 instance.

Storage: A store (reached via a StorageAdapter) contains a bunch of data, and is identified by a storage ID. (This last part is new in the last few releases.)

Document & Heads: A document is a collection of interdependent changes, and the heads are the equivalent of a git commit hash. The difference between a commit hash and a heads array is during concurrent edits, there can be multiple values. We can treat a set of heads as a precise specifier of a point in time.

Peer: A peer is another Repo running somewhere on the internet and identified by a peerId. Many peers may share the same storage, and so we key the saved SyncStates stored by Automerge-Repo's sync protocol against the underlying storage ID. When syncing with multiple peers on the same storage we still send each one sync messages containing all the changes (and they may later write them wastefully multiple times to storage) because storage is passive: it does not notify its users of changes.

SyncState: A sync state maps a storageId to metadata about the last sync we had with it for a particular document. When a Peer begins to sync a particular Document within a Repo, we first check in our local storage to see if we already have a SyncState for them: if we do, we may be able to skip a lot of work!

RemoteHeads: Introduced to support the SyncIndicator in Tiny-Essay-Editor, this is an API that allows forwarding the current heads for a storageId both to the front-end of the application (to be able to ask "am I in sync with this particular storage location"?) as well as to allow forwarding of this information between intermediaries. We can ask "does anyone know if Peter has downloaded that data yet?" or "have the changes I wrote locally reached a remote storage server that I'm not directly connected to?" This is extremely useful but the API we have is not very ergonomic.

Okay: that's the lay of the land. What's the problem? Why am I writing all this?

I think RemoteHeads has exposed a pretty fundamental shortcoming of the current architecture: Peers aren't represented consistently anywhere in the system (there is no Peer class, for example).

There are some interesting technical questions here: how should we store/represent peers we're not currently connected with (or never have been)? If a stateless front-end process wants to know if its edits are stored on the sync server, how can it conveniently tell? This also ties into the question of when it's safe to shut down a one-off CLI process.

One sketch might look like this:

const handle = repo.find(docUrl);

if (handle.peers.length === 0) {
  console.log("o solo mio")
}

if (handle.peers.every(p => p.syncState === SYNCED) {
  console.log("All's well that ends well.")
}

The problem here is that this design doesn't really account for known stores that aren't currently online (connected via one or more peers) or that you aren't directly connected to. We have a more complicated API for that which I don't really want to reiterate here.

So the question is: how to square this circle? Short of making storage "active" (which would rule out the use of quite a few useful storage systems) we probably need to continue synchronizing with multiple peers sharing a storage pool.

Another approach might be to lean more heavily on the idea of the storageId being the "real" peer, and what I currently think of as "peers" being more like "sessions".

More thought is yet required, but I wanted to get these notes down in case anyone else was already looking into this. I think my feeling is that we could clean up these APIs as a small release after 2.2 goes out the door.

heckj commented 2 months ago

re: storage being passive - I was thinking of trying to shim in something (when I'm further along on the implementation) to allow the storage to not be passive. In particular, I can watch for and get notifications of filesystem changes, which I could use as a "something changed" signal (for example, if someone updated the file on Dropbox and it sync'd across).

I haven't fully implemented the remote-heads bits in my work, but the hooks are there. I honestly hadn't been sure what the purpose for them was and how they were used, and it's on my reading list to sit down and trace javascript code to find out, so you short-circuited that by quite a bit by cobbling this (thank you! - kind of like finding a syncstate in local storage)

re: Peers, right now I'm not storing or tracking anything what IS right now, so no historical tracking of peers. That said, I have expanded on what a peer is for the peer to peer networking pieces that I'm in the midst of writing - in addition to a peer_id, I've added a human-friendly name that can be included with that peerId. I'm sticking to UUID for the peer Ids - although I accept any string. I'd like to have some confidence that it had some uniqueness properties though - I'm just "hoping" right now, and it would be an annoying bug if they weren't)

Along those lines, I'd really like to see us firm up any constraints/expectations on PeerId in terms of uniqueness guarantees, if not specific format.

I'm also applying "peer" to the owner of a repo, since. a repo can have multiple network adapters, to handle the scenario where you might be connected to a peer over two different networks, and I want to avoid the duplication of tracking. So I'm somewhere between "peer as owner" and "peer as session", which probably isn't a great place to land.

The only issues I saw with perhaps using the storage Id as a "true peer" are:

1) you don't HAVE to have a storage adapter, and at that point there's no storage Id, which throws a monkey wrench into a lot of this - what's the fallback there? Or do we say we always have a storage Id, even if it's ephemeral/in-meory only like a repo with no storage provider?

2) there's no clear communication channel upward for a storage system today to say "I just got this update, please apply it to your in-memory representations". Other than an in-memory test storage provider, I haven't written any providers for the swift/Apple native platform pieces yet, where I might tackle the signal back up from storage to the repo, letting it know there's been an external change.

HerbCaudill commented 2 months ago

If I was starting from scratch this is how I'd maybe set it up:

userId: A stable identifier for a single human, bot, or server. Assigned by the application. Could be a uuid, domain name, or email address. Replaces peerId as used by sharePolicy.
sessionId: An ephemeral identifier for a single instance of Repo. A uuid, randomly assigned in the constructor.
deviceId: A stable identifier for the device we're running on. A uuid, randomly created on first use by the storage adapter. Different browsers on the same physical device count as different devices. (This is just storageId but with semantics that make more intuitive sense to me.)
peerId: { userId, sessionId, deviceId }

automerge / automerge-repo

On Peers in Automerge-Repo: Design #337