Should we store protobufs gzipped?

gmaclennan commented 3 weeks ago

I realized recently when working on mapbox vector tiles (which as encoded as protobufs) that gzip compression can save quite a bit of space. In the "wild" protobufs are mainly used in network transports, which are normally gzipped (or otherwise compressed) anyway.

If we gzip compress Mapeo records at rest (e.g. on disk) we could save both disk space and reduce network traffic. This would be a breaking change, so it should ideally happen before MVP launch. It would add overhead reading and writing data, but gzip compression tends to be very fast (it would only be a bottle-neck for reading, since write performance is not an issue - syncing would not require gzipping, because the data will be arriving already gzipped).

Is this worth trying to do before MVP?

EvanHahn commented 2 weeks ago

I think there are two ways we could gzip data:

Compress before writing to Hypercore. Roughly:
```
const block = encode(doc)
-core.append(block)
+core.append(gzip(block))
```
This theoretically reduces the size on disk and network. However, in my testing with mock data, this actually makes individual blocks larger! This is because gzip (1) adds 18 bytes of overhead (2) only shrinks data on average, not always. Most of our blocks are small and hard to compress, and the overhead doesn't result in savings.
Compress on disk. Roughly:
```
-new Hypercore(() => new RandomAccessFile('/tmp/data'))
+new Hypercore(() => new RandomAccessGzip('/tmp/data.gz'))
```
This theoretically reduces the size on disk, but not on the network. In my testing, this seems like it could make a big difference when storing documents. However, I suspect blobs will take up the vast majority of space, and they're probably already compressed.

We would need to write RandomAccessGzip (or equivalent), which is nontrivial because gzip isn't designed for easy random access.

I'm not 100% confident in my response here, but I don't think this is worth doing for the MVP.

gmaclennan commented 2 weeks ago

That's really helpful, thanks Evan. As you say, I think our data structure is quite different from vector tile data (which is where I was seeing significant size reduction from gzipping protobufs), which have lots of repeated text data. I think based on this data we can safely close this issue and consider this investigated and discarded as something that we might need to do. It's good to know that it's not something that we need.

digidem / mapeo-core-next

Should we store protobufs gzipped? #701