Provide built-in general-purpose compression

ept commented 3 years ago

Although our compressed binary format (#253) already uses some compression techniques such as RLE, I have found that applying an additional layer of gzip compression may yield an additional factor of 2 or so compression on the text-editing benchmark (for the whole-document encoding). I suspect that for many applications, it would make sense to use such an additional level of compression.

The question is: should this compression be built in to Automerge itself, or should it be handled in a separate I/O layer (when reading/writing documents on disk, when transferring documents over a network)? There are precedents for both:

On the web, HTML, CSS and JS files are typically stored and manipulated uncompressed, and they are compressed on-the-fly when sent over the network (as indicated e.g. by the HTTP header Transfer-Encoding: gzip).
On the other hand, PNG and JPG files have built-in compression, and they are both stored and transferred in compressed form. The libraries that encode/decode these formats include the compression algorithm out of the box. Applications never normally see the decompressed byte stream.

I am beginning to think that maybe the image file model would be a better approach for Automerge, since the binary encoding already requires a dedicated decoder anyway, and having all the compression built-in would simplify the job for users who want to integrate Automerge with various storage and networking libraries. The downside is that we would have to bundle a compression codec with Automerge, increasing the size of the library for any users who don't need it. But I suspect it would be worth turning such compression on by default.

If we do this, we will have to decide on a compression algorithm to use: DEFLATE, LZ4, Snappy, bzip2, Zstandard, or something else? There is a bit of a speed-size trade-off here. Whatever algorithm we use, we should pick one that has good implementations in both JS and Rust. Adopting a widely-used algorithm would make it easier for people who want to write their own tooling around the Automerge data format.

Any thoughts?

savaki commented 3 years ago

I like it. If we follow parquet's lead, then being able to choose the compression on a per column basis could be really useful. Columns like value could benefit highly from this whereas columns like insert might not see that much of a gain.

My vote would be for LZ4 for fast decompression speeds.

HerbCaudill commented 3 years ago

I am beginning to think that maybe the image file model would be a better approach for Automerge, since the binary encoding already requires a dedicated decoder anyway

Agree - the HTML/CSS analogy might have held when the contents were somewhat human-readable, but not with the new encoding.

nornagon commented 3 years ago

Might there be applications where the encoded size isn't something the developer wants to optimize for? Or situations where the developer wants to control when the compression happens?

e.g. it might make sense to hold a bunch of changes in-memory uncompressed, and then at some later point bulk-compress them for transfer.

or you might be only ever interacting with data on a local machine or network where data size isn't a main concern, but CPU cycles are.

or perhaps you want to optimize for bundle size, because initial pageload time is more important than subsequent transfer size for your application.

My vote would be to have this as a separate middleware package, and not built-in to automerge itself.

If you are going to build it in, it would probably be a good idea to pick a compression algorithm that will eventually be part of the web platform available to JS, e.g. through https://wicg.github.io/compression/, which as of the time of writing supports gzip and deflate.

ept commented 3 years ago

I did some experiments using pako for DEFLATE compression: adding it as a dependency increases the minified and gzipped JS bundle size by 13kB. Assuming a typical Automerge document gets a factor of 2 compression, this means you don't need a very big document before the compression algorithm pays for itself in terms of data size. I hate adding dependencies, but this one easily seems a net gain. In terms of speed, I expect that the cost of compression and decompression will remain a negligible fraction of the overall cost of loading or saving an Automerge document. (It's certainly negligible right now, since Automerge is so slow, and if we get Automerge itself to be so fast that DEFLATE is a significant fraction of the CPU time, that will be very good news indeed.) And Rust/wasm are very well suited to this sort of byte array manipulation, so compression should be even faster there.

For that reason I am tending towards enabling compression by default. Middleware or more config options just make the API more complicated, and I haven't seen compelling evidence that such configurability is really needed. On the other hand, I do regularly hear concerns about the size of Automerge documents, and compression will reduce that.

In terms of algorithm, I am tending towards DEFLATE, since it is very widely used (e.g. in gzip and in PNG image files), it's middle of the road in terms of trade-offs (more compact than LZ4, faster than bzip2), and it's widely supported (e.g. in the draft compression streams API that @nornagon pointed out). I don't think we should provide a choice of different compression algorithms, just like PNG doesn't give you a choice either — one algorithm is enough.

This leaves the question of what exactly to compress: one column at a time, or an entire container (= a single change, or an entire document)? There are reasons for wanting both:

When dealing with an individual change, I think we should compress the whole change, not an individual column. The reason is that we must compute the change hash over the uncompressed change, since DEFLATE compression is not entirely deterministic (different encoders may produce different compressed byte streams for the same input), and so the compressed change cannot reliably be reproduced byte-for-byte. This means that when encoding, we must first assemble the columns into an uncompressed change, then hash it, and then compress it.
When dealing with an entire compressed document, the current data format has the nice property that a decoder can load only columns that it wants from the file, since the header tells a reader exactly where in a file it must seek to start reading a particular column. This can be useful if we only want to load one particular part of the document (say, its title), not the whole thing. If we compress the whole document as one big byte array, then this partial loading would no longer be possible, which suggests that it would be better to compress one column at a time when dealing with an entire document. Compressing individual columns may also be advantageous because the data in different columns may have very different statistical properties. Hashing is not a concern with an entire document since we don't rely on encoded documents being byte-for-byte identical.

If we want to allow both compressed containers and compressed columns, this suggests we need the following changes to the binary data format:

We currently have two container types (individual change and whole document); we define a third container type that is an individual change but DEFLATE-compressed. (We don't need a container type for compressed whole document, since whole documents do compression at the column level instead.)
We change the definition of a columnId to include one bit to indicate whether or not the column data is DEFLATE-compressed. Currently the low 3 bits of the columnId indicate the column type, and the higher bits identify the column group. I suggest we shift up the column group by 1 bit, and use the additional bit as compression flag.

Since this is a breaking change to the data format, I would like to get it in before the 1.0-preview release, so that it gets lumped together with all the other breaking changes we have already made, and avoid needing further breaking changes in the future.

nornagon commented 3 years ago

Thanks for the experiments @ept, I'm convinced by your arguments :)

ept commented 3 years ago

Implemented in #330.

automerge / automerge-classic

Provide built-in general-purpose compression #317