automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

More compact change and patch format for list/text insertions/deletions #311

Closed ept closed 3 years ago

ept commented 3 years ago

The current frontend-backend protocol is very verbose for insertions and deletions of sequences of list items (or characters in text). For example, creating the list [1, 2, 3] results in the following change:

{
  "actor": "", "seq": 1, "startOp": 1, "time": 0, "deps": [], "ops": [
    {
      "action": "makeList",
      "obj": "_root",
      "key": "list",
      "pred": []
    },
    {
      "action": "set",
      "obj": "1@48b7977de26e44b29f35a1dc75c66d3c",
      "elemId": "_head",
      "insert": true,
      "value": 1,
      "pred": []
    },
    {
      "action": "set",
      "obj": "1@48b7977de26e44b29f35a1dc75c66d3c",
      "elemId": "2@48b7977de26e44b29f35a1dc75c66d3c",
      "insert": true,
      "value": 2
      "pred": []
    },
    {
      "action": "set",
      "obj": "1@48b7977de26e44b29f35a1dc75c66d3c",
      "elemId": "3@48b7977de26e44b29f35a1dc75c66d3c",
      "insert": true,
      "value": 3
      "pred": []
    }
}

and the following patch:

{
  "objectId": "1@48b7977de26e44b29f35a1dc75c66d3c",
  "type": "list",
  "edits": [
    {"action": "insert", "index": 0, "elemId": "2@48b7977de26e44b29f35a1dc75c66d3c"},
    {"action": "insert", "index": 1, "elemId": "3@48b7977de26e44b29f35a1dc75c66d3c"},
    {"action": "insert", "index": 2, "elemId": "4@48b7977de26e44b29f35a1dc75c66d3c"}
  ],
  "props": {
    "0": {"2@48b7977de26e44b29f35a1dc75c66d3c": {"value": 1}},
    "1": {"3@48b7977de26e44b29f35a1dc75c66d3c": {"value": 2}},
    "2": {"4@48b7977de26e44b29f35a1dc75c66d3c": {"value": 3}}
  }
}

Creating an Automerge.Text object with the content 'a', 'b', 'c' looks very similar. If you copy and paste a few kilobytes of text, the JSON is huge.

This verbosity is not only an efficiency issue, but the patch format is also painful for applications to work with when using the observable API (#308). Note that the indexes of elements to insert are in edits, while the values associated with those indexes are in props. This seemed like a good idea at the time, but it's pretty annoying now…

I propose extending the format for changes and patches to group together consecutive insertions or deletions. This affects only the JSON format used in the frontend-backend protocol; the binary format already uses compression that performs this grouping. For example, the above change might be represented as:

{
  "actor": "", "seq": 1, "startOp": 1, "time": 0, "deps": [], "ops": [
    {
      "action": "makeList",
      "obj": "_root",
      "key": "list",
      "pred": []
    },
    {
      "action": "set",
      "obj": "1@48b7977de26e44b29f35a1dc75c66d3c",
      "elemId": "_head",
      "insert": true,
      "values": [1, 2, 3]
    }
}

For an insertion of multiple consecutive values into a list/text, an operation with a values array expands into multiple operations, one per element of the array. To keep things simple, I propose that this optimisation is only used when all of the values all primitive datatypes representable in JSON (string, boolean, number, null). For nested objects we use a make* operation like we do now, and for values not representable in JSON (e.g. Date or Counter objects) we use a set operation with a single value and a datatype tag, like we do now.

We can also optimise the representation of sequential deletions, assuming that the elemIds of the elements being deleted all have the same actor, and consecutive counter values. For example, to delete the three elements we inserted earlier:

    {
      "action": "del",
      "obj": "1@48b7977de26e44b29f35a1dc75c66d3c",
      "elemId": "2@48b7977de26e44b29f35a1dc75c66d3c",
      "multiOp": 3
    }

The multiOp property indicates that this deletion should be expanded into three deletions, deleting elements with IDs 2@48b7977de26e44b29f35a1dc75c66d3c, 3@48b7977de26e44b29f35a1dc75c66d3c, and 4@48b7977de26e44b29f35a1dc75c66d3c respectively. If we want to delete a range of elements whose elemIds contain more than one actor, or whose counters are not consecutive, we first break the deleted range into sub-ranges that are consecutive, and issue a deletion op for each sub-range.

In the opposite direction (backend to frontend), we can change the patch format to:

{
  "objectId": "1@48b7977de26e44b29f35a1dc75c66d3c",
  "type": "list",
  "edits": [
    {"action": "insert", "index": 0, "elemId": "2@48b7977de26e44b29f35a1dc75c66d3c", "values": [1, 2, 3]}
  ]
}

Again this works with primitive values; I'm less clear on how to represent nested objects, datatypes that need a type tag (e.g. Date), and conflicts (multiple values associated with the same list element). For consecutive deletions, the patch only needs to contain the start index and number of elements to delete.

If we're putting all the inserted values in the edits section of the patch, then what about assigning the value of an existing list element? At the moment assignments are done through the props section, but maybe they should go in the edits array as well.

For text insertions, a further thing we could do: rather than making the values property an array of individual characters, we could just put the raw string (all the characters concatenated) directly in the change/patch, and have the backend break it apart into individual characters. However, this does raise the question of how exactly we should handle complex unicode characters. Perhaps the simplest decent solution is to use Array.from(str) to break a string into Unicode code points, although this does mess up characters that are a combination of several code points (e.g. accents, or some emoji). A more thorough approach would be to split the string into grapheme clusters, but that's getting beyond the scope of this issue.

Any thoughts or concerns? Obviously this change will need to be coordinated with automerge-rs.

echarles commented 3 years ago

Thx @ept for this proposal. I recently had to parse the deep patch, and having a simple less nested structure will make life easier. I understand this would need to be coordinated with automerge-rs

An parallel discussion is about having a patch format compliant with something like delta format https://quilljs.com/docs/delta/ How difficult would it be to have an additional method to convert a automerge patch to a quill delta?

ept commented 3 years ago

Compatibility with Quill's delta format is an interesting one, and I will leave the details to @sliminality and @geoffreylitt, who are working on rich text support. I suspect that for ascii plain text, the conversion will be quite straightforward. Things get trickier once we go beyond ascii. It appears that Quill defines string lengths and indexes in terms of JavaScript's String.length property, which actually returns the number of UTF-16 code units. Things that you might think of as one character, such as emoji, take up a whole bunch of UTF-16 code units. For example, the rainbow flag emoji consists of 4 Unicode code points and 6 UTF-16 code units:

'🏳️‍🌈'.length // returns 6
Array.from('🏳️‍🌈').length // returns 4

So Quill would regard one such emoji as a sequence of six "characters", but that is a specific quirk due to its use of JavaScript. If you want compatibility with with Rust or Python, which are more likely to use UTF-8 strings (not UTF-16), you have to use a different way of breaking up a string into characters, and then we have the problem that Quill and the Rust/Python define the "length" of the same string differently. Thus, a patch that says "insert character 'a' at index 532" will need to take account of the fact that the first 532 characters of the document may contain emoji, and thus the insertion position will not be where we expect it to be.

This is a bit of a digression from the topic of the frontend-backend protocol, though, and it's not unique to Automerge — any system that wants to support both JS-based text editors and non-JS languages will run into this problem. Anyone know how other projects solve this?

ept commented 3 years ago

Okay, here's a concrete proposal for an updated patch format, expressed using TypeScript types since it's a convenient notation:

  // A patch is sent from the backend to the frontend, describing how a document
  // needs to be updated to reflect one or more changes that have been applied.
  interface Patch {
    actor?: string  // the actor that initiated the change (only on local changes)
    seq?: number    // sequence number of the change (only on local changes)
    clock: Clock    // vector clock (mapping from actor ID to max sequence number)
    deps: Hash[]    // hashes of the latest changes applied by the backend
    diffs: MapDiff  // updates to the root object
  }

  // Describes changes to a map (in which case propName represents a key in the
  // map) or a table object (in which case propName is the primary key of a row).
  interface MapDiff {
    objectId: OpId        // ID of object being updated
    type: 'map' | 'table' // type of object being updated
    // For each key/property that is changing, props contains one entry
    // (properties that are not changing are not listed). The nested object is
    // empty if the property is being deleted, contains one opId if it is set to
    // a single value, and contains multiple opIds if there is a conflict.
    props: {[propName: string]: {[opId: string]: MapDiff | ListDiff | ValueDiff}}
  }

  // Describes changes to a list or Automerge.Text object, in which each element
  // is identified by its index.
  interface ListDiff {
    objectId: OpId        // ID of object being updated
    type: 'list' | 'text' // type of objct being updated
    // This array contains edits in the order they should be applied.
    edits: (SingleInsertEdit | MultiInsertEdit | UpdateEdit | RemoveEdit)[]
  }

  // Describes the insertion of a single element into a list or text object.
  // The element can be a nested object.
  interface SingleInsertEdit {
    action: 'insert'
    index: number   // the list index at which to insert the new element
    elemId: OpId    // the unique element ID of the new list element
    value: MapDiff | ListDiff | ValueDiff // description of the value to insert
  }

  // Describes the insertion of a consecutive sequence of primitive values into
  // a list or text object. In the case of text, the values are strings (each
  // character as a separate string value). Each inserted value is given a
  // consecutive element ID: starting with `elemId` for the first value, the
  // subsequent values are given elemIds with the same actor ID and incrementing
  // counters. To insert non-primitive values, use SingleInsertEdit.
  interface MultiInsertEdit {
    action: 'multi-insert'
    index: number   // the list index at which to insert the first value
    elemId: OpId    // the unique ID of the first inserted element
    values: (number | boolean | string | null)[] // list of values to insert
  }

  // Describes the update of the value or nested object at a particular index
  // of a list or text object. In the case where there are multiple conflicted
  // values at the same list index, multiple UpdateEdits with the same index
  // (but different opIds) appear in the edits array of ListDiff.
  interface UpdateEdit {
    action: 'update'
    index: number   // the list index to update
    opId: OpId      // ID of the operation that performed this update
    value: MapDiff | ListDiff | ValueDiff // new value at this list index
  }

  // Describes the deletion of one or more consecutive elements from a list or
  // text object.
  interface RemoveEdit {
    action: 'remove'
    index: number   // index of the first list element to remove
    count: number   // number of list elements to remove
  }

  // Describes a primitive value, optionally tagged with a datatype that
  // indicates how the value should be interpreted.
  interface ValueDiff {
    value: number | boolean | string | null
    datatype?: DataType
  }

  type DataType =
    | 'counter'   // Integer treated as Automerge.Counter object
    | 'timestamp' // Integer treated as Date object (milliseconds since epoch)

The format for maps is unchanged. For lists/text, we no longer have a props object, and instead we put the values directly in the edits. For a list element that is updated in place (or when a nested object inside a list element is updated), we have a new 'update' edit type. Conflicts on list elements are represented by having multiple 'update' edits for the same index. For multiple insertions of primitive values in the same place, we have a new 'multi-insert' edit, which contains a simple array of values. For text, I am assuming that the splitting of an inserted string into an array of characters happens on the frontend; this avoids having to encode a particular policy for splitting strings (by Unicode code point? by grapheme cluster?) into the frontend-backend protocol.

Any thoughts?

alexjg commented 3 years ago

This looks great and should make it easier to implement some of the frontend changes efficiently.

Re. the string splitting point, does this mean that applications with frontends that make different decisions on how to split strings would end up with divergent values for those strings?

ept commented 3 years ago

the string splitting point, does this mean that applications with frontends that make different decisions on how to split strings would end up with divergent values for those strings?

No, what I meant was: something (either a frontend or the application) decides how a string should be split into characters when it is first fed into Automerge, and thereafter all the Automerge machinery preserves that split, however it was done. Thus, Automerge can avoid preferring one splitting policy over another, and leave it up to the application. Does that make sense?

One thing we probably do need to assume is that the splitting process keeps Unicode code points intact. This means that JavaScript's string.split('') cannot be used, because it breaks up non-BMP characters into two UTF-16 code units, and IIRC those individual code units do not have a valid UTF-8 encoding, and this will cause problems in the backend if it validates UTF-8. However, Array.from(string) would be okay for splitting the string, since it keeps code points intact. Grapheme cluster splitting would also be okay.

echarles commented 3 years ago

@ept I think overall any simplification is good to take and I can give my +1 on this proposal. However, to make things easier to understand, could you past the resulting patch of e.g. this change (I have not tested the code, but you get the idea).

const doc = Automerge.init({ text: new Automerge.Text('hello')})
Automerge.applyChanges(doc, d => {
  d.text.insert('2', ...'abc');
});
ept commented 3 years ago

The insertion of 'abc' at index 2 will result in this patch:

{
    "actor": "9886a0e72e674092992f10e674860cf1",
    "seq": 2,
    "maxOp": 9,
    "clock": {"9886a0e72e674092992f10e674860cf1": 2},
    "deps": [],
    "diffs": {
        "objectId": "_root",
        "type": "map",
        "props": {
            "text": {
                "1@9886a0e72e674092992f10e674860cf1": {
                    "objectId": "1@9886a0e72e674092992f10e674860cf1",
                    "type": "text",
                    "edits": [
                        {
                            "action": "multi-insert",
                            "index": 2,
                            "elemId": "7@9886a0e72e674092992f10e674860cf1",
                            "values": ["a", "b", "c"]
                        }
                    ]
                }
            }
        }
    }
}

We are retaining the nesting structure, where the updates to the text object appear nested inside rootObject.props.text['1@9886...'], because this structure simplifies frontends (Automerge 0.x has a flat patch structure without this nesting, and it is a pain to interpret correctly, which is why we moved to this nested structure for 1.x). However, the patch for the text object is much simplified, requiring only a single multi-insert action.

echarles commented 3 years ago

I guess the objectId,... fields make sense from within Automerge. As a consumer, I guess the following is enough to me, but this does not take into account multiple changes, on the same text field, or on multiple different fields. As I said, any simplification is good to have, still having the flexibility and power to cover more complex cases.

{
    "actor": "9886a0e72e674092992f10e674860cf1",
    "seq": 2,
    "maxOp": 9,
    "clock": {"9886a0e72e674092992f10e674860cf1": 2},
    "deps": [],
    "diffs": {
        "props": {
            "text": {
                "action": "multi-insert",
                "index":  2,
                "elemId": "7@9886a0e72e674092992f10e674860cf1",
                "values": ["a", "b", "c"]
              }
            }
        }
    }
}
echarles commented 3 years ago

We are retaining the nesting structure, where the updates to the text object appear nested inside rootObject.props.text['1@9886...'], because this structure simplifies frontends (Automerge 0.x has a flat patch structure without this nesting, and it is a pain to interpret correctly, which is why we moved to this nested structure for 1.x). However, the patch for the text object is much simplified, requiring only a single multi-insert action.

@ept I trust your previous experience with automerge to keep enough nesting and flexibility.

+1 to get this proposal implemented.

ept commented 3 years ago

Yeah, I realise the format looks a bit baroque. The patch format needs this structure in order to handle various edge cases, even if those edge cases are rare (for example, there may be more than one object under the "text" key if different users created conflicting text objects; and any one change may contain multiple insertions and deletions at several different places in a document). But one thing we could explore is to map the complicated patch structure into a simpler API for observers that is easier to consume. I have opened issue #314 for that, so that we can keep this issue focussed on the patch format for backend-frontend communication.

alexjg commented 3 years ago

I believe we need to add an additional type to the set of possible diff types (currently MapDiff, ListDiff, ValueDiff). Consider the following testcase (in the old patch format):

it('should apply updates inside list element conflicts', () => {
      const birds = uuid(), item1 = uuid(), item2 = uuid(), actor = uuid()
      const patch1 = {
        clock: {[actor]: 1},
        diffs: {objectId: '_root', type: 'map', props: {birds: {[actor]: {
          objectId: birds, type: 'list',
          edits: [{action: 'insert', index: 0}],
          props: {0: {
            actor1: {objectId: item1, type: 'map', props: {species: {actor1: {value: 'woodpecker'}}, numSeen: {actor1: {value: 1}}}},
            actor2: {objectId: item2, type: 'map', props: {species: {actor2: {value: 'lapwing'   }}, numSeen: {actor2: {value: 2}}}}
          }}
        }}}}
      }
      const patch2 = {
        clock: {[actor]: 2},
        diffs: {objectId: '_root', type: 'map', props: {birds: {[actor]: {
          objectId: birds, type: 'list', edits: [],
          props: {0: {
            actor1: {objectId: item1, type: 'map', props: {numSeen: {actor1: {value: 2}}}},
            actor2: {objectId: item2, type: 'map'}
          }}
        }}}}
      }
      const doc1 = Frontend.applyPatch(Frontend.init(), patch1)
      const doc2 = Frontend.applyPatch(doc1, patch2)
      assert.deepStrictEqual(doc1, {birds: [{species: 'lapwing', numSeen: 2}]})
      assert.deepStrictEqual(doc2, {birds: [{species: 'lapwing', numSeen: 2}]})
      assert.strictEqual(doc1.birds[0], doc2.birds[0])
      assert.deepStrictEqual(Frontend.getConflicts(doc1.birds, 0), {
        actor1: {species: 'woodpecker', numSeen: 1},
        actor2: {species: 'lapwing',    numSeen: 2}
      })
      assert.deepStrictEqual(Frontend.getConflicts(doc2.birds, 0), {
        actor1: {species: 'woodpecker', numSeen: 2},
        actor2: {species: 'lapwing',    numSeen: 2}
      })
    })

How should the update to the actor2 opid in patch2 be represented in the new patch format? I would suggest the addition of an UnchangeDiff which just has the objectId and type keys.

ept commented 3 years ago

Implemented by #328