automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 467 forks source link

Provide built-in general-purpose compression #317

Closed ept closed 3 years ago

ept commented 3 years ago

Although our compressed binary format (#253) already uses some compression techniques such as RLE, I have found that applying an additional layer of gzip compression may yield an additional factor of 2 or so compression on the text-editing benchmark (for the whole-document encoding). I suspect that for many applications, it would make sense to use such an additional level of compression.

The question is: should this compression be built in to Automerge itself, or should it be handled in a separate I/O layer (when reading/writing documents on disk, when transferring documents over a network)? There are precedents for both:

I am beginning to think that maybe the image file model would be a better approach for Automerge, since the binary encoding already requires a dedicated decoder anyway, and having all the compression built-in would simplify the job for users who want to integrate Automerge with various storage and networking libraries. The downside is that we would have to bundle a compression codec with Automerge, increasing the size of the library for any users who don't need it. But I suspect it would be worth turning such compression on by default.

If we do this, we will have to decide on a compression algorithm to use: DEFLATE, LZ4, Snappy, bzip2, Zstandard, or something else? There is a bit of a speed-size trade-off here. Whatever algorithm we use, we should pick one that has good implementations in both JS and Rust. Adopting a widely-used algorithm would make it easier for people who want to write their own tooling around the Automerge data format.

Any thoughts?

savaki commented 3 years ago

I like it. If we follow parquet's lead, then being able to choose the compression on a per column basis could be really useful. Columns like value could benefit highly from this whereas columns like insert might not see that much of a gain.

My vote would be for LZ4 for fast decompression speeds.

HerbCaudill commented 3 years ago

I am beginning to think that maybe the image file model would be a better approach for Automerge, since the binary encoding already requires a dedicated decoder anyway

Agree - the HTML/CSS analogy might have held when the contents were somewhat human-readable, but not with the new encoding.

nornagon commented 3 years ago

Might there be applications where the encoded size isn't something the developer wants to optimize for? Or situations where the developer wants to control when the compression happens?

e.g. it might make sense to hold a bunch of changes in-memory uncompressed, and then at some later point bulk-compress them for transfer.

or you might be only ever interacting with data on a local machine or network where data size isn't a main concern, but CPU cycles are.

or perhaps you want to optimize for bundle size, because initial pageload time is more important than subsequent transfer size for your application.

My vote would be to have this as a separate middleware package, and not built-in to automerge itself.

If you are going to build it in, it would probably be a good idea to pick a compression algorithm that will eventually be part of the web platform available to JS, e.g. through https://wicg.github.io/compression/, which as of the time of writing supports gzip and deflate.

ept commented 3 years ago

I did some experiments using pako for DEFLATE compression: adding it as a dependency increases the minified and gzipped JS bundle size by 13kB. Assuming a typical Automerge document gets a factor of 2 compression, this means you don't need a very big document before the compression algorithm pays for itself in terms of data size. I hate adding dependencies, but this one easily seems a net gain. In terms of speed, I expect that the cost of compression and decompression will remain a negligible fraction of the overall cost of loading or saving an Automerge document. (It's certainly negligible right now, since Automerge is so slow, and if we get Automerge itself to be so fast that DEFLATE is a significant fraction of the CPU time, that will be very good news indeed.) And Rust/wasm are very well suited to this sort of byte array manipulation, so compression should be even faster there.

For that reason I am tending towards enabling compression by default. Middleware or more config options just make the API more complicated, and I haven't seen compelling evidence that such configurability is really needed. On the other hand, I do regularly hear concerns about the size of Automerge documents, and compression will reduce that.

In terms of algorithm, I am tending towards DEFLATE, since it is very widely used (e.g. in gzip and in PNG image files), it's middle of the road in terms of trade-offs (more compact than LZ4, faster than bzip2), and it's widely supported (e.g. in the draft compression streams API that @nornagon pointed out). I don't think we should provide a choice of different compression algorithms, just like PNG doesn't give you a choice either — one algorithm is enough.

This leaves the question of what exactly to compress: one column at a time, or an entire container (= a single change, or an entire document)? There are reasons for wanting both:

If we want to allow both compressed containers and compressed columns, this suggests we need the following changes to the binary data format:

  1. We currently have two container types (individual change and whole document); we define a third container type that is an individual change but DEFLATE-compressed. (We don't need a container type for compressed whole document, since whole documents do compression at the column level instead.)
  2. We change the definition of a columnId to include one bit to indicate whether or not the column data is DEFLATE-compressed. Currently the low 3 bits of the columnId indicate the column type, and the higher bits identify the column group. I suggest we shift up the column group by 1 bit, and use the additional bit as compression flag.

Since this is a breaking change to the data format, I would like to get it in before the 1.0-preview release, so that it gets lumped together with all the other breaking changes we have already made, and avoid needing further breaking changes in the future.

nornagon commented 3 years ago

Thanks for the experiments @ept, I'm convinced by your arguments :)

ept commented 3 years ago

Implemented in #330.