automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

Feature request: Pruning support #422

Closed Cypher1 closed 3 years ago

Cypher1 commented 3 years ago

Wondering if Automerge is considering / interested in adding support for pruning / Garbage collection etc.

There's significant size issues for some use cases (our app has issues with storing data in lists/array style data and is currently debating changing our storage backend to get more storage (mainly) for Automerge CRDTs).

If this isn't supported we can find work arounds, but it'd be great to at least hear about the road map.

Thanks for the CRDTs :)

ept commented 3 years ago

Hi @Cypher1, this is something we've been considering, and it has been discussed on issue #51.

Are you finding excessive storage use even with Automerge's new (compressed) binary storage format? I'd be interested to hear more about your workload and why it is causing the files to be so big. Generally my tendency is to first make it efficient to store history, and only to delete history if storing it is untenable. Pruning/garbage collecting history is possible, but it introduces edge cases that require a lot of care (e.g. some nodes may no longer be able to sync with a node that has deleted its history).

Cypher1 commented 3 years ago

Thanks @ept The binary compression does help. I think we're can do a bit more on our side too, changing the way that our changes are applied.

Currently (rough understanding from a short chat with my tech lead) we're doing things like replacing a string, rather than modifying it. My TL is looking into that, but another engineer has also gotten us more storage space, so we're fine for the short term.

Thanks again, J

ept commented 3 years ago

Okay, good to hear that. In general, in Automerge, it's best to make your updates as fine-grained as possible. Not only will this be more efficient, it also enables better merging behaviour. For example, if two users replace an entire string, the merged result will be only one of the two users' strings, and the other will be discarded. If you instead structure the update as inserting or deleting individual characters in the text, then those edits can be merged.

corwin-of-amber commented 2 years ago

Since you were asking for use cases: I have a collaborative editor using OT that has the operations stored in an Automerge list and the users' cursors in a dictionary. The operations are (mostly) append-only, but cursors move around a lot. When a new collaborator joins, sending all the past cursors take seconds (!) even for a tiny document. Clearly the cursors are a non-critical part and removing them from the history should not prohibit merging. So, can anything be done about it? At least, if it's not removed from history, can they somehow be skipped when communicating with new peers?

BTW document snapshots using the binary format in Automerge 1.0.x are awesome.

pvh commented 2 years ago

Just a note that I generally use cursors as my go-to example of "data that is useful to synchronize but I wouldn't store in a CRDT". Is there some reason you're persisting it?

corwin-of-amber commented 2 years ago

Good point! My first approach was to dispatch the cursors using messages. And I would probably go that way. But when I thought about it a bit more, it's not that clear-cut. I totally agree with you: it is a case of data that does not need to be persisted yet is "awfully convenient" to have in the CRDT, for several reasons:

This can perhaps be solved by exposing some kind of key-value store that would be multiplexed over the same connection that carries the Automerge sync messages. It seems that, at least in this case, this is perfectly doable completely separately from Automerge; but it could also be a good idea to reuse some of the Automerge machinery.

ept commented 2 years ago

Thanks for your use case. Compacting the history would be one way of doing this, but I think it makes sense to also support ephemeral data as part of the repository API (#486, work in progress). This approach would send ephemeral updates over the same network connection as persistent updates, and we can let it preserve causal ordering as well.

corwin-of-amber commented 2 years ago

Cool! Thanks @ept, this looks like an immensely useful proposal (I have struggled with precisely such things as the multi-peer connection myself and I'm not 100% sure that I got it right), and the ephemeral state item hits the spot. Looking forward to it.

pvh commented 2 years ago

So, anecdotally I agree with your various concerns but have found "just push it over the messaging channel" has worked just fine in all our prototypes.

One other approach I have considered but we have not experimented with is creating non-persisted edits. The approach (which would require changes to the Automerge API) would be to create an actor which is synchronized but not written to disk and which other persisted actors do not depend upon.

This would give you what you want but at the cost of a bunch of extra conceptual and API complexity, and in our discussiong about the idea it hasn't been clear that it's worth the confusion (how could you tell which data was durable versus not?) when in our experience "just send it as a message" has actually worked just fine.

Would love to hear your take on this. No promises or anything, but outside perspectives are welcome.

corwin-of-amber commented 2 years ago

A side channel for sending cursor data is totally acceptable, but "just push it over to the messaging channel" does not work at all; even in my current implementation, I had to be extra careful about when the text editing frontend applies the changes relative to when it processes and displays the cursors. When I failed to do it, the result was inaccurate and erratic cursor positioning, and some cursors not appearing at all because the location ended up being outside the active document or has just been gobbled up by the user's deletion operation.

It seems a reasonable assumption that the messages can also be synchronized with the document revisions, in the same way that changes are ordered among themselves. But it's definitely not going to be the naïve solution; these messages need to refer to a revision, and the client must wait for the revision to arrive. If the revision occurs in the past, then the client has to adjust the cursor positioning – although, arguably, this is not as critical when cursor values represent coordinates post-change, because that will make this event pretty rare (only when edits arrive at the exact same time slot).

Here is a suggestion (which perhaps can already be implemented on top of the existing API): hold two documents, where one is persisted and holds the edit history, and the other is a key-value store of cursor values. The second document can be re-created at any time as long as currently connected clients are notified and switch over to the new copy. The missing piece of the puzzle is then how to ensure causal consistency between two documents; something that @ept's proposal seems to address as well.

pvh commented 2 years ago

That's great input, thanks.