dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

Use compressors collection library #6265

Open jakirkham opened 2 years ago

jakirkham commented 2 years ago

In PR ( https://github.com/dask/distributed/pull/6259 ), it was suggested that we might benefit from using one library that collects several compressors. Some libraries like this include cramjam, compress, numcodecs, python-libarchive-c, etc.. This would allow us to have one dependency to use for our compressors.

jakirkham commented 2 years ago

cc @martindurant

martindurant commented 2 years ago

cramjam has worked out very nicely for fastparquet. The install is smaller than the separate libraries, has no further deps, a consistent API and (often) faster than the standard ones. It supports the buffer protocol and can even decompress_into if you have a buffer pre-allocated. I have yet to apply it to fsspec, where compression is far less important.

jakirkham commented 2 years ago

Yeah the decompressing into a buffer feature can be useful. Not sure if we need that in Distributed serialization currently, but that could change. Though I think with that we need to track the decompressed buffer size separately right? Or does cramjam add that to the compressed result somehow (perhaps as a binary header)?

martindurant commented 2 years ago

In the parquet case, we know the decompressed size - otherwise you have to send it separately. I think maybe one or two of the codecs contain the size in a header.

jakirkham commented 2 years ago

Yeah we could certainly add logic to track buffer sizes pre-compression. It's not hard. There may be some value in doing that.