Open jakirkham opened 2 years ago
cc @martindurant
cramjam has worked out very nicely for fastparquet. The install is smaller than the separate libraries, has no further deps, a consistent API and (often) faster than the standard ones. It supports the buffer protocol and can even decompress_into if you have a buffer pre-allocated. I have yet to apply it to fsspec, where compression is far less important.
Yeah the decompressing into a buffer feature can be useful. Not sure if we need that in Distributed serialization currently, but that could change. Though I think with that we need to track the decompressed buffer size separately right? Or does cramjam add that to the compressed result somehow (perhaps as a binary header)?
In the parquet case, we know the decompressed size - otherwise you have to send it separately. I think maybe one or two of the codecs contain the size in a header.
Yeah we could certainly add logic to track buffer sizes pre-compression. It's not hard. There may be some value in doing that.
In PR ( https://github.com/dask/distributed/pull/6259 ), it was suggested that we might benefit from using one library that collects several compressors. Some libraries like this include cramjam, compress, numcodecs, python-libarchive-c, etc.. This would allow us to have one dependency to use for our compressors.