dcwatson / deflate

Python extension wrapper for libdeflate.
MIT License
25 stars 6 forks source link

Decompression failed #54

Closed timmytwoteeth closed 1 month ago

timmytwoteeth commented 1 month ago

Hello,

I am receiving this error:

    file = deflate.gzip_decompress(file.read())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
deflate.DeflateError: Decompression failed

Reproduce:

pip install deflate

Here is the code:

    with open(file_path, "rb") as file:
        file = deflate.gzip_decompress(file.read())
    file = BytesIO(file)

The Python version is 3.12 and Ubuntu 22.04.

File is .gz.

Thank you.

dcwatson commented 1 month ago

Does it work with Python's built-in gzip.decompress function? My first thought is the GZ file is not actually GZip, or corrupt somehow, but hard to know for sure.

timmytwoteeth commented 1 month ago

Does it work with Python's built-in gzip.decompress function? My first thought is the GZ file is not actually GZip, or corrupt somehow, but hard to know for sure.

Hello,

Thank you for the response.

The file is a warc.gz file.

If it is helpful for diagnosis. The file being tested can be obtained here: wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz

henryiii commented 1 month ago

FYI, I was curious, and yes, it does work with gzip.decompress. It takes a long time (the zipped file is 888 MB), but it does work. While the failure is pretty much instant with deflate.

$ docker run --rm -it quay.io/pypa/musllinux_1_2_x86_64
# wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
# python3.12 -m pip install deflate
# python3.12
>>> import deflate
>>> import gzip
>>> file_path = "CC-MAIN-20180420081400-20180420101400-00000.warc.gz"
>>> with open(file_path, "rb") as file:
...     file = deflate.gzip_decompress(file.read())
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
deflate.DeflateError: Decompression failed
>>> with open(file_path, "rb") as file:
...     file = gzip.decompress(file.read())
...

(Technically, I gave up waiting on that, but gzip.open(file_path).read() did return eventually when I tried it originally).

dcwatson commented 1 month ago

The built-in archive utility on macOS also chokes on this file. But plain old gunzip works, and it expands to a 3.7 GB file (fairly quickly). libdeflate is probably failing fast due to one of the checks in https://github.com/ebiggers/libdeflate/blob/master/lib/gzip_decompress.c but I'm not sure I want to go digging into which specifically, as it's probably out of scope for this library.

In any case, libdeflate is definitely the wrong tool for files of this size - see https://github.com/ebiggers/libdeflate?tab=readme-ov-file#api for comments from the author.

timmytwoteeth commented 1 month ago

FYI, I was curious, and yes, it does work with gzip.decompress. It takes a long time (the zipped file is 888 MB), but it does work. While the failure is pretty much instant with deflate.

$ docker run --rm -it quay.io/pypa/musllinux_1_2_x86_64
# wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
# python3.12 -m pip install deflate
# python3.12
>>> import deflate
>>> import gzip
>>> file_path = "CC-MAIN-20180420081400-20180420101400-00000.warc.gz"
>>> with open(file_path, "rb") as file:
...     file = deflate.gzip_decompress(file.read())
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
deflate.DeflateError: Decompression failed
>>> with open(file_path, "rb") as file:
...     file = gzip.decompress(file.read())
...

(Technically, I gave up waiting on that, but gzip.open(file_path).read() did return eventually when I tried it originally).

Hi @henryiii,

The file will work with gzip.decompress. I tested with a variety of warc.gz files. Each time it was incredibly slow.

Doing something very standard like:

with open("my warc.gz", "rb"):
    byte_stream = io.BtyesIO(file.read())

Works much faster to decompress/read/stream/open.

The built-in archive utility on macOS also chokes on this file. But plain old gunzip works, and it expands to a 3.7 GB file (fairly quickly). libdeflate is probably failing fast due to one of the checks in https://github.com/ebiggers/libdeflate/blob/master/lib/gzip_decompress.c but I'm not sure I want to go digging into which specifically, as it's probably out of scope for this library.

In any case, libdeflate is definitely the wrong tool for files of this size - see https://github.com/ebiggers/libdeflate?tab=readme-ov-file#api for comments from the author.

Hi @dcwatson,

Thank you for the additional information. Given that this is performance-sensitive at scale, every additional second for decompression adds up. So having something that can decompress on average faster is quite meaningful. It looks like deflate may not be the best for this case then given the size of the .gz files.

I have been benchmarking a few other decompression libraries.

@dcwatson @henryiii Do either of you have thoughts on libraries worth testing?

Thank you.

dcwatson commented 1 month ago

I'd probbaly look at zlib-ng - there are python bindings here

Not an endorsement, as I've never used either, but it seems like zlib-ng is significantly faster than vanilla zlib and does streaming decompression.

timmytwoteeth commented 1 month ago

I'd probbaly look at zlib-ng - there are python bindings here

Not an endorsement, as I've never used either, but it seems like zlib-ng is significantly faster than vanilla zlib and does streaming decompression.

Thank you for the information.

I will add this to the benchmark as well.

henryiii commented 1 month ago

I don't actually know anything about decompression[^1], I just helped with the packaging. :)

[^1]: I've technically done a little research in algorithms a long time ago, but not familiar with libraries

timmytwoteeth commented 1 month ago

I don't actually know anything about decompression1, I just helped with the packaging. :)

Footnotes

  1. I've technically done a little research in algorithms a long time ago, but not familiar with libraries

Pybind is an awesome library so we definitely appreciate all your work.