automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.77k stars 467 forks source link

What should we benchmark? #312

Open alexjg opened 3 years ago

alexjg commented 3 years ago

Benchmarks

I've been working on the performance of the Rust automerge implementation and as a result have been thinking about how we want to benchmark ourselves. I want to build a benchmarking suite for automerge so that can be sure we are not introducing performance regressions as we iterate the algorithms and APIs we are building. I'm raising this issue to get some consensus on what we should measure.

Everything here is a strawman, please criticise strongly.

Categories of metrics

What does being performant mean for an Automerge implementation? @ept, @orion and I jumped on a call and we tentatively identified three top level categories of metrics we might care about:

The difference between these categories is that user edits must be much faster than backend changes. It's probably acceptable if changes from other peers take a few hundred milliseconds to appear in the UI, but local changes such as inserting text or drawing a freehand drawing needs to be imperceptibly fast.

For any metric there are two cases we care about. The multithreaded and single threaded cases. The multithreaded case is in general where we expect to see better performance but the single threaded case is still important because we expect the typical developer experience to be to start with the single threaded API (it is after all much simpler to avoid threads entirely) and then to move to multithreaded if performance is not acceptable.

Scenarios

Based on these categories I've tried to come up with some scenarios which I think capture the kinds of things we want to be fast at. Quite a few are inspired by the crdt-benchmarks work done by Kevin Jahns of y.js. For each scenario we should measure three things:

  1. The time taken to apply each change to the frontend and generate a corresponding local change to send to the backend (if applicable).
  2. The time taken to apply each resulting patch to the frontend
  3. The total from starting the scenario to the final patch being applied

Separating things this way allows us to measure the end to end latency, as well as ensuring that we never block the UI thread.

I also think each scenario should be measured both for top level and deeply nested keys on the basis that traversing the document could easily end up being an expensive operation if we are not careful and so we should make sure we're testing that it doesn't.

Scenario Proxied user behaviour
Inserting N characters 1 at a time User typing text into a document
Inserting N characters all at once User pasting text into a document
Deleting N consecutive characters all at once User selects and deletes some text
Inserting N characters at random points in a text object User typing text
Inserting or deleting at a random point in a text object N times User typing text
Appending a map to a list of maps N times User adding a complex value to a document in a continuous fashion, such as when adding the (x,y) coordinates of a free hand drawn path
Inserting a map at random points in a list User interacting with an application that manipulates complex data in a list, such as a todo list
Inserting or deleting at a random point in a list N times User modifying data in a complex application such as a todo list
Inserting a timestamp N times User generating timestamped data in a continuous fashion
Loading a compressed document with N operations in it User loading a saved document
Loading a sequence of N changes, each with 1 operation in it User loading a series of uncompressed, saved changes
Loading a sequence of N conflicting changes to the same property, each with M operations in it Receiving changes from peers over the network

I would love to hear other peoples ideas about what we should be measuring!

ept commented 3 years ago

This looks great, thanks for putting it together! One more suggestion for a scenario would be "deleting N consecutive characters all at once". This is an interesting case because it requires sending the IDs of all the deleted characters.

Besides the CPU time taken, it might also be interesting to record the size of the message sent over the network for each user operation. However, this is unlikely to change much once we've frozen the data format.

adding the (x,y) coordinates of a free hand drawn path

For such an application, rather than representing it as a list of maps with {x, y} coordinates, I would suggest representing it as a flat list of numbers, alternating between x and y coordinates. That will probably be more compact and faster.

alexjg commented 3 years ago

Good shout on the delete benchmark, I'll add that.

Re. the (x,y) coordinates. I can see that representing it as a flat list would be more efficient. Part of the reason I put that benchmark in there was because I wanted to capture the idea of complex continuous actions of some kind (such as drawing, or recording sound or something). In general we can't predict what data people might need to store so I figured a map was a good general purpose benchmark. I can see an argument that we should not be optimising for this (probably quite niche) use case and so just advising users to use a flat list approach makes sense, would that be your thinking?

I suppose my other thought about a flat list is that it doesn't really support concurrent modifications of the path very nicely. Although I guess inserting and removing pairs should still commute.

ept commented 3 years ago

Fair point — happy to have a list of maps on the benchmark. Even if we advise against it, people will probably do it anyway, so we should try to make it fast 😀