Channel concept, granularity of documents and dependency tracking

ept commented 7 years ago

We have previously discussed a few things we'd like to be able to do with Automerge documents:

Forking documents (like forking in github/branching in git), with the choice of whether or not to remain subscribed to upstream documents
Allowing a forked document to be merged back into its upstream document
Tentatively applying a bunch of changes to a local copy of a document, discarding them again if the result is not as desired, or publishing the merged result if the result is good
Template documents that set up some basic scaffolding required for an application, and concrete documents that "inherit" from the template
A mechanism for consuming feeds and merging them into a single document — think RSS feeds, or feeds of currency exchange rates, or suchlike

I had a call with @pvh to discuss how best to implement these concepts in Automerge/MPL. The following are some notes on what we discussed.

As a first step, hash-chaining to encode the dependency graph (#28, like parent commit hashes in git) seems like a good idea. To find all changes that have gone into a document, start with one or more HEAD commit hashes, and traverse the dependency graph. When two communicating nodes have made concurrent changes, they won't know about each others' heads, so they'll need to run a multi-round protocol to figure out their latest common ancestor hash, much like git (#27).

This leaves the question of how you find out about changes to a document. Our proposal is to separate it into two concepts:

A channel is a network abstraction for pub/sub. A change is published to a channel, and a node can subscribe to any number of channels. A channel has a unique identifier, e.g. a UUID. Channels are probably not visible to the end user, but only an internal abstraction.
A document is a set of channels, and it incorporates all changes that appear in any of its channels. A document may exist only on one node, and is not necessarily shared with other nodes. To share a document, its set of channels should perhaps be written to a filesystem CRDT?

The features outlined above can all be implemented using those two concepts:

Forking a document means publishing any future changes to that document to a new channel. The document can remain subscribed to the upstream channels (continuing to incorporate upstream changes), or unsubscribe from the upstream channels.
Merging a forked document back upstream would mean adding a change with a dependency on the fork to the upstream channel. (I guess that implies that a change's dependencies do not have to be in the same channel, but could be in any channel?)
Tentatively applying changes could mean adding the experimental channel to the set of channels for a document; as long as this dependency is not published back to the channel, it can be reverted by simply removing the channel again.
Template documents: if document B inherits from template A, then B includes A's channels as well as its own.
Likewise, merging feeds into a single document is easily achieved by having each feed in a separate channel, and having the document refer to all of them (so the document is the union of the changes in all of the channels).

ept commented 7 years ago

Addendum: Once we have security features (encryption, authentication, access control), I imagine that channels would also be the unit at which permissions would be handled. For example, a user may have read/write access to the channels for their own documents, but read-only access to the channel for a template document. A user can create new channels just for themselves, or choose to grant other users read/write/admin permissions on their channels. From a crypto perspective, a channel would probably be the granularity at which the group key exchange happens.

ept commented 5 years ago

Last October I wrote up a related discussion as a separate document, after a discussion with Peter van Hardenberg and Jeff Peterson. Copying it here for future reference…

We are building applications on Automerge in which the user’s workspace consists not of one big Automerge document, but many separate, small documents (under the slogan “everything is a document”). This approach has a number of advantages:

Each document has a distinct identity that can be represented as a URL, which can be shared with others, and which can be embedded in another document in order to link from one document to another.
Sharing and collaboration can happen at the granularity of a single document, or of a document and all document it transitively references. This is useful since in many applications, a user may not want to share their entire workspace with others, but rather share at a more granular level.

However, we have been thinking about whether this structure is really the best one. The boundaries between documents could be shifted in two ways:

On the one hand, the notion of a document could be expanded to include the entire workspace, provided that we have a mechanism for sharing just a subset of a document.
On the other hand, documents could become more fine-grained; in the most extreme case, every document would contain only a single object (a map, list, or text object), and one document would be nested within another by having the parent object reference the document URLs of the child objects. This is actually quite like how documents work now: parent objects reference the objectIds of their children, and the distinction between objectIds and document URLs is mostly an incidental aspect of our current implementation.

These two approaches are actually quite similar: both remove the current grouping of a bunch of objects into a document, which seems like a somewhat arbitrary structure to impose on an application’s data. There are a few reasons why we might want to get rid of the document concept as it currently exists:

We might want to perform changes that span several documents (or rather, span several of the groupings we currently call documents), for example to move some data from one card to another. Currently, dependency tracking (vector clocks) is performed at the granularity of a document. This is perhaps too broadly drawn, since it creates dependencies from one object’s changes to changes to unrelated objects that happen to be in the same document. In fact, currently all operations are local to a single object, and so it may actually be sufficient to track dependencies on a per-object basis. (Note, however, that if we introduce a move operation, those operations will involve multiple objects, so it will no longer be the case that all operations are local to a single object.)
The distinction between objectIds and document URLs introduces an unnecessary barrier to accessing the data. If we eliminate this distinction, we could more easily imagine performing joins (in the relational database sense) across different documents, for example joining a reference to a user ID with the user’s identity document, in order to render something in the UI with a user avatar (obtained from the identity document).

automerge / automerge-classic

Channel concept, granularity of documents and dependency tracking #31