Add real world concurrent editing traces

dmonad / crdt-benchmarks

A collection of CRDT benchmarks

Other

440 stars 29 forks source link

Add real world concurrent editing traces #20

Open josephg opened 1 year ago

josephg commented 1 year ago

I've found real-world editing traces to have very different performance profiles than simulated (random) editing traces. For diamond types, I've ended up writing a script to reverse engineer editing histories out of git repositories, which has given me some great & terrible editing traces.

(That last link is a graph of the history from Makefile from the git repository for Git itself).

These traces aren't perfect - I'm using diff to figure out what changes in each commit. But they're good enough for benchmarks, and I think a lot better than random traces.

Anyway, I'd love to share them and add them here but I'm not sure what format would suit! How can I contribute them? How do you think we could express concurrent traces in a way that we can apply to multiple algorithms?

dmonad commented 1 year ago

Hi @josephg,

Thanks for offering to share!

How do you currently store the editing trace?

For a linear editing trace (no concurrency), I suggest expressing it as an array of replacement operations (as I did in js-lib/b4-editing-trace.js).

For expressing concurrency, we need to extend the vocabulary. Expressing all kinds of conflicts is very complicated and also not necessary for benchmarking. For the sake of benchmarking conflicts, it would suffice to introduce a fork operation. This could look like this:

const ops = [
  [0, 0, 'Hello world'], // A replacement operation inserting the initial text
  { // fork the document into two different branches
    main: [[6, 1, 'W']], // Make "world" uppercase
    branchXYZ: [[11, 0, '!']] // Append "!"
  },src
  // We merge the forks into the main branch and discard the other branches
  [0, 0, 'this is applied to the previous main branch']
]

My idea would be to introduce a "fork operation" that is expressed as an object with named forks. The naming might be useful for future usage. For now, they could be random, or the branch names. We can only continue with a single fork, so there must always be a "main branch". If there is no main branch, we use the first named fork.

A fork accepts an array of operations. Once all operations are applied to the fork, the fork is merged with the main fork. To simulate real-world scenarios, it doesn't make sense to do a full merge (by exchanging state-vector, etc..). Instead, the generated updates from the fork should be applied to the main branch.

Once the forks are merged into the main branch, we continue with the main branch.

This notation also allows nesting forks.
The idea to name the forks comes from git and might help to explain the idea of the notation.
Naming the forks also gives us the ability to merge branches as an operation. In the future, we could introduce a ['merge', 'fork-A', 'fork-B'] operation that merges the operations from one branch into another to simulate more complicated scenarios.

ept commented 1 year ago

The old paper-editing trace we've all been using for benchmarking actually originally had a bit of concurrency in it as well – I only flattened out that concurrency in the edit-by-index version to make it easier to use. If we can agree on a format for representing concurrency, I can try pulling the concurrent trace out of the original files.

Moreover, a collaborator and I have also been writing a new paper with a VSCode tracker to generate a more realistic editing trace. That also has some concurrency, I think.

The fork operation you propose would work for simple branching patterns, but not more complex ones (e.g. one user first merges branch A and then branch B, while another user concurrently merges those two branches in the other order). But if we don't actually have complex merging patterns in our benchmark data, then that may not matter.

I would also suggest that we make a difference between characters that are typed one-by-one, and character sequences that are inserted all at once (e.g. as a result of a paste). Moreover, what do you think about including a wall-clock timestamp on each operation? Automerge stores such timestamps (with 1-second resolution) for history visualisation purposes; do Yjs/Diamond also have timestamps? The timestamps have no effect on the CRDT logic, but they affect the size of the final file.

josephg commented 1 year ago

Hey guys!

( @dmonad )

How do you currently store the editing trace?

Right now I'm immediately converting it into a diamond types file and then loading & replaying that; but thats an unnecessarily complex file format.

The format I'm using at the moment explicitly describes the graph. It looks sort of like this:

Causal graph:

Operation 0: Parent (ROOT), Agent: 'a'.
- Operations 1..1000 operations are linear (each has a single parent, which is the immediately preceding item.)
Operation 1000: Parents: [890], agent 'b'
- Operations 1001..2000 are linear
Operation 2000: Parents [999, 1999] (merge), agent 'a'
- Then operations 2001..2500 are linear

And so on. The file format for diamond types has no concept of peers or nodes. (Then I also store the operations themselves in the file in a big list).

But the logic for traversing this graph might be tricky with yjs or automerge. How would you implement a fork operation?

I agree with @ept 's comment about forks not necessarily being able to capture all the complex paths. But we could do something like this:

const ops = {
  'xxx': {
    parents: [], // ROOT
    ops: [...], // A list of replace operations. (See below)
  },
  'yyy': { parents: ['xxx'], ops: [...] },
  'zzz': { parents: ['xxx'], ops: [...] },
  'qqq': { parents: ['yyy', 'zzz'], ops: [...] },
}
const finalState = ['qqq']

That would express the graph structure, and I could generate that directly (and pretty easily) from the traces I have. How hard would it be to replay something like that in yjs?

EDIT: ... Or maybe a list would be better? The goal is to recreate / replay the state at the bottom of the file.

const ops = [
  { id: 'xxx', parents: [], ops: [...]},
  { id: 'yyy', parents: ['xxx'], ops: [...] },
  { id: 'zzz', { parents: ['xxx'], ops: [...] },
  { id: 'qqq', parents: ['yyy', 'zzz'], ops: [...] },
}

For a linear editing trace (no concurrency), I suggest expressing it as an array of replacement operations (as I did in js-lib/b4-editing-trace.js).

Yeah thats what I'm doing for my other linear traces in my crdt-benchmarks repository. There's some fluff like timestamps, but the heart of those files looks like this:

> require('./seph-blog1.json').txns.map(t => t.patches[0]).slice(5000)
[
  [ 2572, 0, 'n' ], [ 2573, 0, 'g' ], [ 2574, 0, ' ' ], [ 2566, 9, 'r' ],
  [ 2567, 0, 'e' ], [ 2568, 0, 'a' ], [ 2569, 0, 'd' ], [ 2570, 0, 'i' ],
  [ 2571, 0, 'n' ], [ 2572, 0, 'g' ], [ 2573, 0, ' ' ], [ 2574, 0, 'p' ],
  [ 2575, 0, 'a' ], [ 2576, 0, 'p' ], [ 2577, 0, 'e' ], [ 2578, 0, 'r' ],
  ...

(They're recorded from real editing sessions. They aren't all single character edits even though it looks like it from this sample).

( @ept )

The fork operation you propose would work for simple branching patterns, but not more complex ones (e.g. one user first merges branch A and then branch B, while another user concurrently merges those two branches in the other order). But if we don't actually have complex merging patterns in our benchmark data, then that may not matter.

The data I'm scraping from git certainly has cases like that. Here's a simple editing history from a single file in nodejs. I'd want to capture and replay this trace accurately.

I would also suggest that we make a difference between characters that are typed one-by-one, and character sequences that are inserted all at once (e.g. as a result of a paste).

For what use case?

I can't think of any time I would have the system treat a paste event or a typing event differently. Maybe there would be some cases where you'd want to have special optimizations for large inserts - but in that case, surely a heuristic approach based on the insert length would be what you'd want anyway. A single character paste event seems functionally indistinuishable from a typing event to me.

And obviously, large insert events can always be broken up into a series of single character inserts if thats what your system expects.

Moreover, what do you think about including a wall-clock timestamp on each operation?

I'd be open to this. I have timestamps in the data sets I've recorded myself, with 1 second granularity. But diamond types doesn't currently store timestamps at all. I just can't think of many uses for them, outside of "what did the document look like at (time)?". And for that I'd consider storing timestamps with like, 10 minute granularity or something. I think it makes sense to have a conservative stance wrt. the data we store about user's habits.

josephg commented 1 year ago

Looping back on this, I've made the first pass of an export tool. Love some thoughts.

$ dt export example.dt

{
  "startContent": "",
  "endContent": "yoooooooo\n",
  "txns": [
    {
      "id": 0,
      "parents": [],
      "numChildren": 1,
      "agent": "GdLDwvf0MIfN",
      "ops": [
        [ 0, 0, "hi there\n" ]
      ]
    },
    {
      "id": 1,
      "parents": [ 0 ],
      "numChildren": 1,
      "agent": "r2GLJfHRMNbx",
      "ops": [
        [ 0, 8, "" ],
        [ 0, 0, "yoooo" ]
      ]
    },
    {
      "id": 2,
      "parents": [ 1 ],
      "numChildren": 0,
      "agent": "gsWTmOk9JJOI",
      "ops": [
        [ 5, 0, "oooo" ]
      ]
    }
  ]
}

Does that make sense?

The parents field can, of course, have multiple elements. If the parents field is empty, that means the change is parented off the root.
The numChildren field is used to implement @dmonad 's suggestion of marking forks. If an item has more than 1 child, the state (obviously) needs to be forked for its children.
The id field is literally just the index of an item in the list. But importantly, those IDs are referenced in subsequent items' parents list.
Here I'm using Martin's standard of [pos, del_len, inserted_string] for the patch content itself.

I'm a bit worried that in complex examples, the operation order will be dependent on the algorithm. It might be best to just stick to examples where there aren't any concurrent inserts at the same place. (So the resulting document order is well defined).

I have a script to extract editing histories from any file in any git repository. Its not neat, but it works. If anyone has better sources of data, happy to use that instead.

Some traces:

https://home.seph.codes/public/example_trace.json (the trace above)
https://home.seph.codes/public/node_nodecc.json (the most edited file in nodejs: node.cc)
https://home.seph.codes/public/git_makefile.json (the most edited file in the git repo for git: git's Makefile)

I'll put these files (and some others??) on my crdt-benchmarks github repository. But yeah, I'd really appreciate a glance before I do, to make sure the format doesn't look too crazy.

josephg commented 1 year ago

Ping @alexjg

josephg commented 1 year ago

I've spent the last couple of hours writing a little script to run the benchmarks against automerge, directly in rust (to make it a fair comparison). I assume I'm using automerge correctly since I get the right output after merging all the changes.

I was worried - for some data sets, the resulting document order will be dependent on the CRDT algorithm so I checked if its ever the case that multiple inserts land at the same place in the document. The node_nodecc dataset is not sensitive to CRDT ordering, but git_makefile is. If we make some testing data for benchmarks, I might only publish "ordering-insensitive" editing traces to keep everything simple.

Here's my automerge generation & benchmarking script: https://github.com/josephg/automerge-converter

I'd love to do the same treatment for yjs/yrs too, and get some more traces from different git repositories so we have a representative sample of data.

@dmonad - do you have any thoughts about running benchmarks directly in rust for these algorithms? Hows yrs going?

streamich commented 1 year ago

Does that make sense?

@josephg it makes sense, but only for diamond-types. As diamond-types is a library designed to repay those sort of "causal tree" operations.

However, a list CRDT like RGA (Automerge) or YATA (Y.js) usually is only concerned with keeping the latest version up to date, so they have no ability to replay a trace like that, essentially, for RGA or YATA CRDT to replay a trace like that, they would need to re-implement diamond-types.

It looks to me for the trace to be shareable and reusable by different libraries the patches should be in a form to serve well all algorithms.

Also, we may want to ask a question: what are we benchmarking here? It looks to me that:

No concurrency traces effectively simulate the local operations, where CRDTs insert/delete using positions: (1) insertAt(position, text), and (2) deleteAt(position, length).
With this concurrent editing trace proposal we would simulate remote operations, where CRDTs insert/delete by some reference ID, say: (1) RGA - insert(leftRef, text); (2) YATA - insert(leftRef, rightRef, text); (3) "causal tree" (Diamond Types) - insert(...parentRefs, pos, text).

So, the concurrent trace, IMHO, stresses the remote insert/delete code path of CRDTs. But the remote operations are different in all CRDTs. So, maybe the trace patches should contain all possible metadata needed for all different CRDTs. When I say "all", obviously that is not feasible, but RGA is the most popular algorithm, so that could be included. Also, Y.js is the most popular library, so maybe YATA can be included.

Effectively, what operations need to contain is what the remote site/server would expect to receive:

RGA: (1) op ID; (1) ref ID; (2) text
YATA: (1) op ID; (2) left ref ID; (3) right ref ID; (4) text
CT: (1) op ID; (2) parent IDs; (3) position offset; (4) text

In all three approaches maybe the IDs can be normalized to a 2-tuple:

[agent: number | string, sequence: number]

However, I'm not sure how much they can be reused across approaches. As they all semantically differ. For example, "agent" is a number in Y.js but a string in Automerge and Diamond Types. Also, the "sequence" part of the ID is incremented differently in all three approaches. So, it looks like even the operation ID cannot be shared across the approaches.

Really the only part that can be shared is the "text" content of the operation. Which leads me to thinking: in that case is it even worth to have a common format. Maybe better to transform the trace into the native format of specific library, and see how that library performs on its native format.

But if there was a common format, maybe it could look like something below:

{
  "startContent": "",
  "endContent": "yoooooooo\n",
  "txns": [
    {
      type: "insert",
      content": "hi there\n",
      ct: {
        id: ["GdLDwvf0MIfN", 0],
        parents: [],
        position: 0,
      },
      rga: {
        id: [123, 1],
        ref: [123, 0],
      },
      yata: {
        id: [123, 1],
        leftRef: [123, 0],
        rightRef: [123, 0],
      },
    },
    {
      type: "delete",
      length: 8,
      ct: {
        id: ["r2GLJfHRMNbx", 1],
        parents: [
          ["GdLDwvf0MIfN", 0]
         ],
        position: 0,
      },
      rga: {
        id: [456, 2],
        ref: [123, 1],
      },
      yata: {
        id: [456, 0], // 0 here, as in YATA, unlike RGA, you don't need to take max() of all seen ops

        // Actually, Y.js does not even implement operations for deletes.
        // It acts as state-based-CRDT for deletes, so not sure what to put here.
        ref: [123, 1]
      },
    },
    // and so on ....
  ]
}

Another way to think about this concurrent trace is that: it is just like the non-concurrent trace, but with the exception that the position is replaced by some complex object with "refs", which is different for each algorithm, basically:

{
  "startContent": "",
  "endContent": "yoooooooo\n",
  "txns": [
    {
      type: "insert",
      content": "hi there\n",
      position: {   // <--- In non-concurrent trace this would be just a number
        ot: {
          // the correct position as from perspective of every agent in the trace
          "GdLDwvf0MIfN": { lastKnownOp: 0, position: 0 },
          "r2GLJfHRMNbx": { lastKnownOp: 0, position: 0 },
          // all agents ...
        },
        ct: {
           // refs and position...
        },
        rga: {
           // refs...
        },
        yata: {
           // refs...
        },
      },
    }

    // and so on ....
  ]
}

The above also incorporates OT approach: basically the correct OT position is pre-computed for every agent's perspective.

This also highlights the similarity that Diamond Types has with OT: (1) both approaches need to encode somehow the "last know op" or "parents"; and then (2) both approaches also provide the position "guess" (i.e. local position) and then recompute the "correct" position.

An then, the RGA could actually be thought as being equivalent to CT/OT, with the difference that in RGA there are no absolute positions in a string, i.e. strings don't start from 0, but are referenced relative to some character. So, RGA also provides a "guess" of the position, it is just always implicitly 0—immediately after the ref.

{
  "txns": [
    {
      type: "insert",
      content": "foo",
      position: {
        rga: {
          parent: [123, 1], // = ref

          // Implicit "guess" of RELATIVE position in relation to "parent"
          // position: 0,

        },
      }
    }
  }
}

To sum up, OT and CT do not assign a unique ID to every character, so position is absolute and has to be recomputed by looking up the history to the last known "parent" and the absolute local position needs to be transformed to become absolute remote position.

On the other hand, RGA and YATA assign a unique ID to every character, so position is relative to the "parent", where "parent" is a character itself. Hence, the local relative position "guess", which is always 0, needs to be adjusted to become the remote relative position with respect to the "parent" character.

So, if RGA/YATA would not store their characters as a linked list, but instead just stored the operations like OT/CT; for example RGA operation could look like: [ref, relativePositionFromRef]. Then on every insert into RGA it would need to look-up the history to the right "parents" and adjust the relativePositionFromRef of all of those operations, similar to what OT/CT is doing. So, in a way the assignment of unique IDs, plus, the trick of storing those IDs/characters in a linked list form allows RGA/YATA to avoid the operation transformation step, which OT/CT has to do.

Then recursively resolving all the RGA operations in the form [ref, relativePositionFromRef] until the beginning of history, such that relativePositionFromRef finally becomes an absolute position should somehow be similar to what Diamond Types is doing.

josephg commented 1 year ago

@streamich

However, a list CRDT like RGA (Automerge) or YATA (Y.js) usually is only concerned with keeping the latest version up to date, so they have no ability to replay a trace like that, essentially, for RGA or YATA CRDT to replay a trace like that, they would need to re-implement diamond-types.

Really the only part that can be shared is the "text" content of the operation. Which leads me to thinking: in that case is it even worth to have a common format. Maybe better to transform the trace into the native format of specific library, and see how that library performs on its native format.

Yeah looks like we totally agree here - thats absolutely the goal! I don't want other algorithms to need to implement diamond types - but as I said above, I've managed to write a simple ~100 line script to convert data from this format into automerge or yjs's native formats anyway. The goal isn't to benchmark that conversion process. But once the data is converted, we can use the converted trace to benchmark how well automerge / yjs / sharejs / whatever else can load, save and merge the operations it contains.

Re: What are we trying to benchmark, the idea is to compare how well the different collaborative editing systems would handle the same collaborative editing session.

Ideally I want the following metrics:

How fast can it process / interpret local operations
How fast can it process remote operations streamed from other peers
How much disk space overhead does the system need?
How much RAM does the system consume at runtime?

Comparing remote operations seems easy to compare to me. Obviously each system will do different work internally, but it should be pretty easy to arrange data like this into a series of "network messages" (patches) received by a passive peer that merges them all together. Measuring how fast a given system can ingest a series of remote messages and then output the resulting document state seems like a reasonable, useful and fair benchmark.

Comparing local operations seems like a trickier case to me, but we can probably take an editing trace and run a simulation of one of the peers. The simulation could run a carefully prepared trace which includes the following actions:

Receive "remote" operations when other agents make changes
Generate local changes when the local peer modified the document
Generate and serialize patches containing local changes, for consumption by the (simulated) remote peers.

Total size on disk and memory usage are self explanatory.

Given that we can convert any of these traces to any of those formats (yjs, rga, yata, etc) using a simple script, I don't think its worth adding the additional ID fields to these editing traces. - Since as you said, those IDs are quite implementation-specific and idiosyncratic, and we would need to pick ahead of time which algorithms to support. For an example of idiosyncracies, yjs's sequence numbers aren't consumed by deletes. Whereas automerge does consume sequence numbers in deletes. Yjs uses an integer for agent ID. Diamond types uses a string and automerge uses a fixed sized array of bytes. I'd much rather leave all of these details to the algorithms we're testing.

Re: agent IDs, I'm not sure that the format should include agent names. If we only share traces which merge the same regardless of algorithm (in practice, this means no concurrent inserts at the same location) then the agent ID stops being useful. But it would come in handy if we want to run simulations from the point of view of a single peer (since we can use the agent IDs to filter the trace).

streamich commented 1 year ago

Generating CRDT-specific traces out of Diamond Types trace sounds good, if indeed those scripts are simple. (Also, would be nice to have a generic RGA converter script, not specific to Automerge format, as RGA is the most popular algorithm.)

I'm not sure I understand why you think that local operations are trickier to compare. I thought the existing editing traces are already good enough to compare the local operations.

BTW, some CRDT implementations might not even have a dedicated local operation path. One way to think about local insert/delete operations is that they are just performance optimizations over the remote insert/delete.

For example, in my implementation I don't have a dedicated local delete operation, instead local delete immediately creates the "parents" and executes locally the remote delete. Also, I didn't have a local insert for a long time, only added it later as an optimization.

josephg commented 1 year ago

Generating CRDT-specific traces out of Diamond Types trace sounds good, if indeed those scripts are simple.

See for yourself!. Its ~100 lines of code, not including JSON parsing.

(Also, would be nice to have a generic RGA converter script, not specific to Automerge format, as RGA is the most popular algorithm.)

What has convinced you that RGA is the most popular algorithm? If you think a generic RGA converter is useful, you're welcome to write a tool to do this conversion yourself. The above algorithm can be used to generate the data for any sequence CRDT a positional editing trace, so long as that algorithm implements a fork method, a localOp(...) method and a merge(crdt1, crdt2) method. So it works with automerge, yjs, and the traditional style sequence CRDT I built in diamond types before moving to my new positional approach. Hopefully it can be made to work with your CRDT too!

Converting from RGA to other formats sounds harder to me than simply replaying a positional editing trace, but I haven't thought about it. Do you think that would be easier? How would you convert RGA to yjs's format, for example?

I'm not sure I understand why you think that local operations are trickier to compare. I thought the existing editing traces are already good enough to compare the local operations.

My partner and I recorded an editing trace the other night. I edited my little wiki to add 1s of latency, and then both of us typed up notes about a tv show we watched at the same time, into a shared document. The resulting document is 4000 words long, totalling just 21kb of text. But the causal graph (even run-length encoded) is this absolute unit of a graph:

(NOTE: this SVG is 4mb in size. You will need to zoom out and/or scroll right to see the graph): https://home.seph.codes/public/friendsforever.svg

This data set is very different from single user editing traces. I expect those differences will manifest in a lot of weird ways for different CRDTs. As an example, the optimizations I've made to the diamond types file format work much less well on this editing trace than on the single user traces I have. I suspect yjs will store this data set more compactly than I am managing to do.

Another reason to simulate an editing history using interleaved local and remote operations is that it seems more "honest". I can imagine some CRDT implementations simply buffering local changes and applying / merging them all at once. A test which mixes remote changes and local changes seems like it would even the playing field.

One way to think about local insert/delete operations is that they are just performance optimizations over the remote insert/delete.

I think about local insert / deletes as being different because the local peer first needs to convert the edit from a numeric position (from the editor) into the CRDT's internal data format. Eg, the operation Delete pos 1000 needs to be converted to Delete item with ID (xx,yy). Are there CRDTs for which that is not true?

I suspect this conversion may be fast / trivial for some CRDT implementations, and slow for others depending on the data structures used. As you said, some implementations will optimize this and some will not.

streamich commented 1 year ago

What has convinced you that RGA is the most popular algorithm?

It is the most cited and the best know/researched algorithm that provides good (O(log n)) performance for all operations.

How would you convert RGA to yjs's format?

I didn't mean to convert from RGA to Y.js; rather was just saying that a converter from positional trace to RGA would be really nice to have. Would be happy to write it, the only problem is I don't know Rust, so not sure how well that will go.

I think about local insert / deletes as being different because the local peer first needs to convert the edit from a numeric position

That is true, I was just saying that that conversion of position to ID can be done in 2 different ways:

One way is to have a dedicated insertAt(position, text) local method: it would insert the text first and then return the ID of "ref". It is an optimization, because insertAt can already be achieved by other operations that a list CRDT will provide.
Another way is to execute the local insert is to use the other methods that the list CRDT will need to implement anyways, those two are: (1) remote insert insert(ref, text); and (2) find operation findRefAtPos(position). Basically, the local insert then becomes insert(findRefAtPos(position), text). This approach is nice because there is only one code path for inserts (to test and optimize); but it is worse for performance, as "find" + "insert" will need to do double the work.

As an example, the optimizations I've made to the diamond types file format work much less well on this editing trace than on the single user traces I have.

Specifically which optimizations do you have in mind?

I suspect yjs will store this data set more compactly than I am managing to do.

Why?

Storage is one thing, I'm wondering how will Y.js handle performance-wise for a concurrent trace. Y.js uses a cache of 80 items to effectively search in its linked list, as far as I understand. I'm wondering if given some number of concurrent users and some syncrhonization latency, where many cache entries are consumed by each user, how would it impact that caching optimization.

Lets simplify, say each caching entry is used only by one user, and there are 81 concurrent users, but only 80 cache slots; how would it impact Y.js performance?

But then Y.js has a benchmarking scenario for random inserts in its bench suite, so I guess it is not going to be slower than in that benchmark. And it looks like random inserts and sequential ones are performed at about the same speed (from the benchmark results).

Another reason to simulate an editing history using interleaved local and remote operations is that it seems more "honest".

Agreed, I think it is not just more "honest", but more relevant. Because:

Benchmark of remote operations better shows the implementation would perform on the server, where you actually care about the performance.
It also shows how fast the implementation can "replay" all the operations, this is useful in "time travel" UI scenarios.
It also shows how well an implementation would perform if the storage was pure operation based, i.e. when the latest state is always replayed from scratch when the document is loaded.
1. Or even, if it is not purely operation based storage; if the snapshots of the state are created at some cadence, but the tip of changes is just stored as operations.

BTW, I really appreciate that you have collected those traces and created that vscode-tracker plugin.

josephg commented 1 year ago

It is the most cited and the best know/researched algorithm that provides good (O(log n)) performance for all operations.

RGA implemented as it’s written in the paper (as a tree) isn’t O(log n) because the tree is unbalanced. Inserts approximate O(n) time in practice.

Re: optimisations, the DT file format stores the distance travelled from each edit to the next edit. (Positive or negative). When two users are typing at the same time at different locations, this optimisation doesn’t work as well. DT and automerge also store the full causal graph - which is usually very small and simple. But in a trace like this, it’s quite large.

If you’re explicitly looking for unusually pathological traces which will make systems perform badly, traces where millions of edits concurrently type at the same location will also perform very badly with all of these systems. I’m not that interested in traces like that because it’s unrealistic that it shows up by chance. And we have bigger problems at the moment if users are malicious. (See Martin’s BFT paper).

Anyway, this is all a big sidebar for the issue here. Sounds like you’re more or less happy with the format. That’s good news.

josephg commented 1 year ago

Alright, final proposal after a lot of unrelated conversation:

For multi user editing traces, the dataset is a JSON file containing all the operations made by users. The dataset has the following fields:

startContent: The (string) contents of the document before users edited it. For all the data sets I publish, this will be the empty string.
endContent: The (string) contents of the document after all changes have been applied. This exists mostly as a checksum / validation that the trace has been replayed correctly.
txns: A list of "transaction" objects. Each object describes 1 or more operations, made by some peer, at some point in time. (Caveat: the last change may contain no operations)

txns have the following fields:

id: This is an integer, matching the position of this item in the array. Might take this out
parents: This describes other items in the txn list that this transaction happened after.
- If the list contains 2 or more items, the state named by those items is merged before the operations contained in this transaction are applied. The list of parents must always be minimal (its invalid to reference two items which aren't mutually concurrent).
- If the parents list is empty, the item comes after "root" (the start of time). In the data sets I publish, I think I'll try to make sure there will be exactly 1 item (idx 0) which has no parents.
- All of an item's parents must come earlier in the txn list. Ie, the order of txns conforms to the partial order of the implied causal graph.
patches: A list of operations, in the same format as the automerge-perf data set. Each item in this array is a triple of (position, num characters deleted, inserted string). The position names the unicode codepoint offset into the document. (I may publish ascii-only variants of editing traces as well to make this easier to interpret).
agent: This is an integer ID of the agent that made this change. All changes by a given agent ID must be fully ordered. (ie, its invalid for two changes with the same agent ID to be concurrent).
numChildren: The number of other (later) txns which contain this txn. I'm on the fence about whether to include this. Its useful when replaying the traces, but it can be trivially calculated from the data set.

Misc rules, some repeated from above:

To make the trace replayable by multiple CRDTs with different ordering rules, its invalid for 2 changes to happen at the same location in the document. This obviously makes these data sets unsuitable for testing, but I think its rare enough for edits to happen at the same place that it shouldn't really matter; so long as the behaviour when concurrent edits hit the same location isn't pathological.
The first transaction must have no parents. The first transaction must be the only transaction with no parents.
The last transaction must transitively merge / contain all other transactions in the data set.
All transactions except the last must have a non-empty list of operations.
The parents must all be mutually concurrent
All transactions from a single user agent must be fully ordered

Open questions:

Is it worth keeping id in there if its just the index in the array? I'm leaning towards no.
Is it worth keeping numChildren? ... I'm leaning toward yes because it means the simple conversion algorithms can run with 1 pass.
patches or ops? I'm leaning toward patches to match the other benchmark data sets in crdt-benchmarks
For consistency, should we also include timestamps in the data sets?

Example data set:

{
  "startContent": "",
  "endContent": "yoooooooo\n",
  "txns": [
    {
      "id": 0,
      "parents": [],
      "numChildren": 1,
      "agent": 0,
      "patches": [
        [ 0, 0, "hi there\n" ]
      ]
    },
    {
      "id": 1,
      "parents": [ 0 ],
      "numChildren": 1,
      "agent": 0,
      "patches": [
        [ 0, 8, "" ],
        [ 0, 0, "yoooo" ]
      ]
    },
    {
      "id": 2,
      "parents": [ 1 ],
      "numChildren": 0,
      "agent": 1,
      "patches": [
        [ 5, 0, "oooo" ]
      ]
    }
  ]
}

Unless there's any comments, I'll start publishing some data sets in this format.

streamich commented 1 year ago

Lets go! I think id will probably not going to be useful, but numChildren is very handy, so indeed the forks/indices can be allocated in one pass.

josephg commented 1 year ago

I agree. I'll:

Remove id
Keep numChildren
Call it patches for consistency
Remove startContent, since its always the empty string anyway.
And add a numAgents field in the top level JSON object, just in case thats useful.
Add a "kind": "concurrent" field to differentiate these traces from sequential editing traces

josephg commented 1 year ago

The first concurrent editing trace is published here, along with a spec: https://github.com/josephg/editing-traces/tree/master/concurrent_traces

I've also uploaded a linearized (flattened) copy of this editing trace into the sequential_traces folder in that repo.

The next step is integrating this stuff into yjs. And I'd love to make some more traces!

ept commented 1 year ago

Sorry for the delayed reply, I just caught up on this thread and realised I never answered this question:

I would also suggest that we make a difference between characters that are typed one-by-one, and character sequences that are inserted all at once (e.g. as a result of a paste).

For what use case? I can't think of any time I would have the system treat a paste event or a typing event differently.

Sorry, I was unclear. I don't care either whether a particular insertion op was the result of typing a character or performing a paste. But if some characters were typed one by one, I would like them to be listed as individual ops in the trace (not aggregated into a longer string), ideally with a wall-clock timestamp on every op (or patch, I don't mind what we call it). I wasn't sure whether this is what you're already planning anyway. There are several reasons for this request:

If we include a timestamp for every edit, a paste has lots of characters all inserted at the same timestamp, while a char-by-char editing session has a different timestamp for each char (assuming timestamps with 1 second resolution or finer). I'm interested in preserving timestamps for the purpose of visualising editing history, and Automerge captures timestamps for that reason (even though admittedly we haven't yet exposed the timing information in the API or any UI, but it's something we want to do). Of course some implementations might not want to record timestamps, or record them only at coarser resolution, and that's fine, but if we have timing information available it would be nice to include it in the traces.
If we want to measure the network bandwidth used over the course of a simulated real-time collaboration session, we need to know whether a bunch of characters were sent to collaborators as a bunch of individual character-insertion messages over time, or as a single message containing a pasted string. Bandwidth may not seem like the most important metric to measure since today's internet connections have plenty of it, but we've been looking at running collaborative editors over anonymity networks such as Nym to support high-risk users such as sensitive journalistic investigations, and these networks are very bandwidth-constrained and also sensitive to the exact timing of messages. Traces with timing information for each user operation would allow us to experiment with such anonymous collaboration use cases.
Finally, to measure the performance of replaying a backlog of remote ops that were sent in real time as the user performed those apps, threre may also be a performance difference between single-char ops and multi-char pastes.

Otherwise I'm fine with the format you propose. Thanks for doing this, and I will also aim to contribute some traces!

josephg commented 1 year ago

@ept Thanks for the feedback - that makes a lot of sense and I agree. I'll add a timestamp field to transaction objects in the spec, and a dummy timestamp to the dataset I've recorded already. Getting some more multi user traces would be great if you can think of a reasonable way to record them. And @streamich is in the process of preparing & contributing some traces too.

Unfortunately recording editing traces which accurately split each editing event using diamond types is a bit tricky because I actively merge consecutive edits together everywhere I can. I might need to rethink my recording strategy before I record anything else. (Or, maybe add a sidechannel to store edit groupings & timestamps for trace recording and reconstruct that information later.)

I'd also like to use my git extractor tool to get some traces from git logs. But unfortunately, it turns out to be extremely common for traces extracted from git to contain concurrent inserts at the same location in the file. (And I'm trying to avoid that here so we don't get problems due to algorithms making different ordering choices). Unless we get lucky, my git extracting tool will probably not generate a lot of useful traces for cross-system benchmarking.