dcwatson / deflate

Python extension wrapper for libdeflate.
MIT License
25 stars 6 forks source link

Hi from python-isal #10

Closed rhpvorderman closed 2 years ago

rhpvorderman commented 2 years ago

Hi, I work on python-isal, which wraps ISA-L. It also aims to accelerate compression/decompression and it supports streaming features.

Unfortunately ISA-L only works well on x86-64 (Intel, AMD) so it is much more limited than deflate in that respect.

Given that you probably work on this library because of some compression/decompression needs, I wanted to let you know about python-isal. Also I wanted to say hi, as another coder working on python bindings for a deflate-compatible compression library.

rhpvorderman commented 2 years ago

One possible improvement I see: when decompressing I see here a void * is created which is later copied into a bytes object with PyBytes_FromStringAndSize. But you can also do

PyObject * return_value = PyBytes_FromStringAndSize(NULL, decompressed_size);
void * decompressed_data = (void *)PyBytes_AS_STRING(return_value); 
// Decompression set up here
libdeflate_gzip_decompress(
        decompressor, data.buf, data.len, decompressed_data, size, &decompressed_size);
// error-handling code here
return return_value;

This way you only allocate a output buffer once for the bytes object. No copying required.

dcwatson commented 2 years ago

Thanks for the suggestion! It wasn't quite that simple, since decompressed_size isn't known before decompression, but there is a _PyBytes_Resize function. Almost certainly better than a copy.

I wrote this to use in https://github.com/imsweb/pzip, which compresses (and encrypts) in chunks, so I had no need for a streaming interface -- libdeflate is very well suited for this case.

rhpvorderman commented 2 years ago

t wasn't quite that simple, since decompressed_size isn't known before decompression

Well it should be equal to the ISIZE block from the gzip trailer. Otherwise the gzip is corrupt. So you already initiate the buffer with the correct size. And the nice thing is that _PyBytes_Resize quits early when the size is already correct. So no resizing happens in the correct case.

I wrote this to use in https://github.com/imsweb/pzip, which compresses (and encrypts) in chunks, so I had no need for a streaming interface -- libdeflate is very well suited for this case.

Ah very useful. Chunked compression is also used by a format in bioinformatics called BAM. It uses block gzip format, which is basically compressed blocks. The length of the compressed block is saved in the first EXTRA field, while the length of the decompressed result is saved in ISIZE. This is very useful as you know the exact sizes of the buffers.

rhpvorderman commented 2 years ago

I just got a notification (I comaintain the conda-feedstock for libdeflate) https://github.com/ebiggers/libdeflate/releases/tag/v1.9. FYI.