Proposal for a new Automerge.TextSpan type

automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.

http://automerge.org/

MIT License

14.75k stars 467 forks source link

Proposal for a new Automerge.TextSpan type #239

Open nornagon opened 4 years ago

nornagon commented 4 years ago

In order to support the Ordered Array of Formatting Spans (OAFS) model for Rich Text, it's necessary to maintain a list of references into an Automerge.Text object, along with the formatting attributes to be applied to the given range. For illustration, consider represent the string "Gandalf the Grey", where the string "Gandalf" is bold:

{
  "text": "Gandalf the Grey",
  "attributes": [
    { "from": 0, "to": 7, "attributes": { "bold": true } }
  ]
}

However, in order to make sure the references remain consistent across operations, instead of using index references, one should store the ID of the referenced character. IDs for individual characters are not currently exposed by Automerge, and further, IDs would need to be translated back to string offsets in order to be meaningful to a client.

Spans would also be useful for representing and transforming user cursors/selections.

API

Creating a TextSpan would be done from the corresponding Text object:

const text = new Automerge.Text('Gandalf the Grey')
const span = text.getSpan(0, 7)

Spans are Automerge objects, so they can be stored in a document:

const doc = Automerge.from({
  text: new Automerge.Text('Gandalf the Grey'),
  attributes: []
})
doc.change(d => {
  const span = d.text.getSpan(0, 7)
  d.attributes.push({ span, attributes: { bold: true } })
})

Spans can be inspected, and have "start" and "end" properties that are the calculated offsets into the Text object which they reference:

doc.attributes[0].start === 0
doc.attributes[0].end === 7

When the associated Text has characters inserted or removed, Span objects will stay up-to-date:

const text = new Automerge.Text('Gandalf the Grey')
const doc = Automerge.from({
  text,
  span: text.span(0, 7)
})

const doc2 = doc.change(d => {
  d.text.insertAt(3, 'non')
})

doc2.span.end === 10

nornagon commented 4 years ago

Hm, one potential complication with implementing this by storing IDs to the beginning/end of the span is that the IDs would need to be updated when the characters to which they refer are deleted.

For instance:

Gandalf the Grey
    ^              here's my pointer
  |___|            delete these 5 characters
Ga the Grey        new string
 ^                 pointer should end up here

ept commented 4 years ago

Thank you for the proposal. We definitely need something along these lines, and I hope we can work out a good API together.

First, a simplification suggestion: rather than a span (which consists of two positions: start and end), would it be sufficient to have a "cursor" or "pointer" or "reference" (whatever we want to call it) type? Then a span could use two of those cursors, while use cases that require only a single position can just use one.

Second, whatever we do should fit within Automerge's immutable model. Creating a cursor will require a state change in the document (because Automerge will have to keep track of the cursor so that it can update its index when remote users change the document, and that will have to be recorded in a data structure somewhere), so the function for creating a cursor will have to return a new document object.

Another thing to decide is: should a cursor be a persistent object that is recorded as part of the Automerge state and replicated to other documents, or should it just exist locally in memory? If in memory, we could still allow it to be sent to another user out-of-band, e.g. so that you can visualise other users' cursor positions in a text editor. To send cursor positions, the API would need to let you get the elemId of the character referenced by the cursor, and at the other end allow you to create a cursor given such an elemId.

As I noted in my comment on #240, translating between list indexes and elemIds will require a round-trip through the backend, so the API will need to allow for that asynchronicity. We can still have a synchronous convenience API for users who are just using a single thread.

nornagon commented 4 years ago

Thank you for the proposal. We definitely need something along these lines, and I hope we can work out a good API together.

First, a simplification suggestion: rather than a span (which consists of two positions: start and end), would it be sufficient to have a "cursor" or "pointer" or "reference" (whatever we want to call it) type? Then a span could use two of those cursors, while use cases that require only a single position can just use one.

Yeah, I think that makes sense. My thought with going for "Span" was that a "cursor" could be represented by a span where the start & end point to the same place. But I think you're right that the single-reference version is the more versatile API.

Second, whatever we do should fit within Automerge's immutable model. Creating a cursor will require a state change in the document (because Automerge will have to keep track of the cursor so that it can update its index when remote users change the document, and that will have to be recorded in a data structure somewhere), so the function for creating a cursor will have to return a new document object.

Hm, I'm not sure I follow. If a cursor is merely an ID (i.e. the identifier of a particular character), then creating one shouldn't necessitate a document change.

Another thing to decide is: should a cursor be a persistent object that is recorded as part of the Automerge state and replicated to other documents, or should it just exist locally in memory? If in memory, we could still allow it to be sent to another user out-of-band, e.g. so that you can visualise other users' cursor positions in a text editor. To send cursor positions, the API would need to let you get the elemId of the character referenced by the cursor, and at the other end allow you to create a cursor given such an elemId.

@pvh had some thoughts here about sharing cursors by committing them to the document using special actorIds that were "ephemeral", i.e. whose changes would not be saved to disk. However, I think it would also make sense to be able to manipulate fully out-of-document references.

As I noted in my comment on #240, translating between list indexes and elemIds will require a round-trip through the backend, so the API will need to allow for that asynchronicity. We can still have a synchronous convenience API for users who are just using a single thread.

For references which are stored in a document, would it make sense to do that translation eagerly, i.e. at the time the patch is applied, rather than lazily when the reference is requested? I think an async API for this would be a significant usability impediment. (For out-of-document references, if automerge supports them, it seems fine to provide an async API, since there's no way that the backend could know about them.)

pvh commented 4 years ago

If it is synchronized, replicated and has a history, it is a change to the document and we should model it as one. If we want "ephemerality" that is orthogonal to cursor-ness and we can handle that separately. :) I don't have strong feelings about the API at this point.

ept commented 4 years ago

Actually, yes — for the OAFS model, the cursors have to be part of the persistent state of the document. We can handle ephemeral state separately, as @pvh proposed. So this suggests an API like:

const doc = Automerge.change(d => {
  d.text = new Automerge.Text('Gandalf the Grey')
  d.attributes = []
  d.attributes.push({
    start: d.text.getCursorAt(0), // alternative: new Automerge.Cursor(d.text, 0)
    end: d.text.getCursorAt(7),
    attributes: { bold: true }
  })
})
doc.attributes[0].start.index // 0
doc.attributes[0].start.elemId // available after round-trip through backend

Does that look reasonable? Do you want to have a go at implementing it, or should I?

nornagon commented 4 years ago

I think you meant to include const doc2 = doc.change(...); doc2.attributes[0]...?

What happens if you call getCursorAt outside of a change block?

I'm happy to take a stab at implementing it, though I'm not too familiar with automerge's internals. I might have to DM you a question from time to time 😅

ept commented 4 years ago

Sorry, I fixed the code snippet.

What happens if you call getCursorAt outside of a change block?

I think the cursor object should be immutable, just like all other parts of an Automerge document. Therefore, I suggest that when you call getCursorAt you get an Automerge.Cursor object representing that position at that point in time, but that cursor will not be updated as the document changes.

I might have to DM you a question from time to time

No problem!

geoffreylitt commented 3 years ago

I'm trying to build a text editor with Automerge that supports inline comments referencing spans in the text. For my use case I think I could just use marker objects to denote where comments start/end (because comments don't combine with each other like formatting rules do), but it still seemed slightly more natural to store the cursors outside the text itself, which brought me to this issue.

If no one else has worked on this, I'd be quite interested in attempting an implementation, but I have a few questions. Sorry if these don't make sense as I'm still learning the codebase.

First, @ept I was curious if you had thoughts on how to efficiently keep cursors updated? Assuming no changes to the backend, it seems to me that the simplest approach would be something like: each cursor stores an elem ID. Whenever the text changes, search for the elemID in the new elems list, and save that new index in the cursor. But this is O(n), as discussed in #198. I could imagine fancier approaches like "when 3 characters get inserted before the cursor, increment its index by 3" -- that sounds error-prone given my lack of knowledge, but maybe could work?

Second, for the JSON serialization of the cursor in the doc, would it make sense to just expose the integer index as the only information and keep the elem ID as a private implementation detail? Or is the ID considered part of the public interface?

Third -- are there any examples of existing classes in the codebase that would most closely resemble Cursor? Maybe a simple object like Counter?

ept commented 3 years ago

Hi @geoffreylitt, sorry for the delayed reply, I've been on vacation. Great to hear that you're looking into this.

I also like the approach of having a cursor store an elemID, but this raises the challenge of updating the index when the text changes, as you correctly point out. The O(n) search might be worth trying in the first instance, since it's so simple and for short documents it may well perform well enough. If it's too slow, we can replace the linear search with a more sophisticated data structure.

Here is a sketch of a data structure that can efficiently convert elemIDs to indexes. Break up the text into chunks of approximate equal size. For each chunk, construct a Bloom filter containing the elemIDs in that chunk, and store the number of characters in that chunk. To get the index for an elemID:

function elemIDtoIndex(elemID) {
  let prevChunks = 0
  for (let chunk of chunks) {
    if (chunk.bloomFilter.matches(elemID)) {
      const chunkIndex = chunk.elemIDs.indexOf(elemID) // returns -1 if not found
      if (chunkIndex >= 0) {
        return prevChunks + chunkIndex
      }
    }
    prevChunks += chunk.numChars
  }
}

To insert a character into the document, insert it into the appropriate chunk; if the chunk size exceeds some limit, break it into two adjacent chunks of approximately equal size. Similarly after deleting a character, merge two adjacent chunks into one if they get smaller than some lower bound. This is still technically O(n), but it should be much faster, since we can skip over most of the chunks and only need to scan the elemIDs in one chunk (or occasionally a few chunks due to Bloom filter false positives). Essentially this gives us a two-level tree (chunks are the upper level, individual characters the lower level); it could be generalised into a multi-level B-tree, which would make it O(log n).

I don't think we actually need to implement that algorithm right now — I just wanted to outline it to show that indexes can efficiently be computed from elemIDs.

On your other points:

For serialization, I think I would store the elemID (not the index), since the elemID is a stable identifier while indexes may change when merging changes.
Yes, Counter is a good example to look at.

ept commented 3 years ago

@geoffreylitt A follow-up thought. The conversion from elemIDs to indexes could happen either in the frontend or in the backend, and I'm not sure which is better. If it happens in the backend, then the patches sent from backend to frontend will need to include index updates for any cursors whose indexes have changed. That would require a bit of housekeeping to keep track of all of the cursors, but it would avoid having to maintain extra data structures on the frontend. The backend already performs elemID-to-index conversion anyway, so instinctively it seems more logical for the backend to do this work on behalf of the cursors. But I'm not sure.

geoffreylitt commented 3 years ago

Thanks for the notes @ept! Sorry for my delayed reply as well, just returning from my own vacation :)

That makes sense to try a linear search first, and the Bloom filter / B-tree idea seems like a neat trick for optimizing. Depending on where the search is happening, maybe it could be fine to have a slow-ish search as long as local text edits are still low-latency, since remote edits have to deal with network latency anyway?

re: backend vs frontend: am I understanding correctly that the SkipList in the backend can already efficiently convert elem IDs to indexes? Also, I noticed that on #240 you wrote the comment below:

On the performance branch I've been working on I have actually removed the elemIds array from Automerge.Text, because I would like to keep the frontend as simple and lean as possible, and maintain this metadata in the backend instead. For users who are running everything on one thread, this is not a problem, but for users who are running frontend and backend on separate threads, it will no longer be possible for the Automerge.Text object to look up elemIds.

Is that still accurate?

If it's true that the backend can already efficiently do this conversion, and that you're moving away from elem IDs being available at all in the frontend in the future, perhaps those both point towards doing this work in the backend? Also, if the frontend-backend protocol is changing for the performance branch anyway, might it make sense to just add this cursor feature on that branch rather than make the change on main?

For serialization, I think I would store the elemID (not the index), since the elemID is a stable identifier while indexes may change when merging changes.

Ah, I was confused here--I had thought the JSON serialization was the main interface for how the user would normally read the value of the cursor. But now I see that it's just a way of transporting changes over the wire, and the cursor would be an object where the user could request either the index or the elemId, as in your code snippet above. So this makes sense.

ept commented 3 years ago

Depending on where the search is happening, maybe it could be fine to have a slow-ish search as long as local text edits are still low-latency, since remote edits have to deal with network latency anyway?

Yes, that's a good point. As long as the search doesn't lock up the UI thread for so long that it introduces noticeable lag, it should be ok.

am I understanding correctly that the SkipList in the backend can already efficiently convert elem IDs to indexes?

Correct; in fact, that's its main purpose. It's efficient in the sense that conversions are O(log n), although it's still quite slow in practice when performing lots of operations. On the performance branch I am in the process of removing the skip list, and replacing it with something like the Bloom/B-tree described earlier.

I have actually removed the elemIds array from Automerge.Text

We ended up reverting this because it created more trouble than it was worth. As of last week (#293), the Automerge.Text frontend once again knows the elemId for each character.

Also, if the frontend-backend protocol is changing for the performance branch anyway, might it make sense to just add this cursor feature on that branch rather than make the change on main?

Yes, I think it would be worth basing the work on the performance branch; we are hoping to ship a release based on this branch within the next few months. However, for a first simple implementation, you can use the method Automerge.Text.getElemId(index) to get the element ID for a particular index; this API is the same on the main and performance branches.

geoffreylitt commented 3 years ago

@ept here's an initial stab at the frontend-backend protocol changes needed to support cursor maintenance in the backend. No idea if this makes sense given my limited experience with this codebase, but curious what you think:

add a new type of change op for initializing a cursor. Since cursors are immutable, perhaps the creation op also communicates the objId of the Text that the cursor is pointing to, and the elemId of the character?: { action: 'makeCursor', obj: objectId, textObjId: objectId, cursorElemId: elemId }
Whenever the text is changed, the backend emits a patch to tell the frontend to update the cursor index: { obj: objectId, type: 'cursor', key: 'index', action: 'set', value: number }

However, for a first simple implementation, you can use the method Automerge.Text.getElemId(index) to get the element ID for a particular index; this API is the same on the main and performance branches.

I tried hacking on this a bit to see if I could do a simple implementation entirely in the frontend. One barrier was that I wasn't sure how to store a reference to the text object inside the cursor. It seems most obvious to store an object ID, but I couldn't find a way to resolve that object ID to the Text object itself from within the frontend.

In any case, it seems like you think the backend is the right place for the cursor maintenance to live anyway, so maybe there's not much of a point to trying to do an implementation in the frontend.

ept commented 3 years ago

Hi @geoffreylitt, sorry for the delay.

One barrier was that I wasn't sure how to store a reference to the text object inside the cursor. It seems most obvious to store an object ID, but I couldn't find a way to resolve that object ID to the Text object itself from within the frontend.

You can use Automerge.getObjectId(doc.text) to get the objectId of a text object, and Automerge.getObjectById(doc, objectId) to get the object with a given ID. Those are frontend functions.

For the frontend-backend protocol, rather than defining a new op type I think it might be better to define a new datatype. This is how Automerge.Counter objects are implemented. The difference is that a new object type (defined by a new 'make*' op type) is currently always a container for multiple child objects, whereas a datatype indicates how a particular value should be interpreted.

However, I'm a bit hesitant to make changes to the frontend-backend protocol right now because we're trying to get mammoth effort #300 to land, and this involves a revamp of that protocol. So my inclination would be to do cursors frontend-only for now if possible, and look into backend support later as an optimisation.

geoffreylitt commented 3 years ago

My notes from call with @ept, @pvh, @sliminality:

Reactions to initial PR

There is concern about the approach taken in #306 creating coupling between frontend and backend, given that there are at least some projects running them in different threads. That same general approach could be tweaked to require an async call from backend to frontend to resolve an elem ID to an index, but that didn't seem very ergonomic.

Cursor datatype

A more complete approach--as discussed at various points in this GH thread--is to add a Cursor datatype, support it in change requests + patches, have the backend maintain cursor indexes, and send patches to the frontend to update cursors when necessary. From the frontend perspective, cursors would essentially just be integers that automatically update when needed. Seemed like general consensus that this is the best path forward despite the extra work.

One minor question is how to initialize all the cursor indexes on doc load, but seemed like @ept had thoughts on a reasonable way to do this.

Zombie text

An important question that emerged is what to do with cursors when the text object is entirely replaced. It's dangerous to let cursors keep returning an index into an object that's no longer reachable. On the call we discussed storing the parent ID, but I don't think that would work because the parent could also be replaced; maybe we need a path stored all the way from the root, and need to re-check the path still points to the object every time? (presuming there's no other more efficient mechanism to check reachability)

A related proposal was to somehow store the cursors "together with" the text object itself so they all get replaced together, but it's not clear to me quite how this would work since users need to store other data along with the cursor. This also wouldn't really support storing cursors outside of Automerge, or having cursors pointing to text in other Automerge docs.

Other notes

OAFS is still not proven out, so might make sense to prototype some rich text integrations before getting too far with a specific cursors API
It would be really nice for Automerge to have example integrations with common rich text libraries
What about generalizing this to some kind of "pointer" to any part of an Automerge doc? Tabled for now.
On the performance branch, elemID includes objectID, which might be helpful

Next steps

@sliminality and I will figure out a way to usefully hack on this. I'd guess going further up the stack and prototyping rich text stuff on top of #306 to figure out a useful API?
@ept said he might try the Cursor datatype implementation at some point just to see if he can get it done quickly