Closed timmytwoteeth closed 1 month ago
Does it work with Python's built-in gzip.decompress
function? My first thought is the GZ file is not actually GZip, or corrupt somehow, but hard to know for sure.
Does it work with Python's built-in
gzip.decompress
function? My first thought is the GZ file is not actually GZip, or corrupt somehow, but hard to know for sure.
Hello,
Thank you for the response.
The file is a warc.gz
file.
If it is helpful for diagnosis. The file being tested can be obtained here: wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
FYI, I was curious, and yes, it does work with gzip.decompress
. It takes a long time (the zipped file is 888 MB), but it does work. While the failure is pretty much instant with deflate.
$ docker run --rm -it quay.io/pypa/musllinux_1_2_x86_64
# wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
# python3.12 -m pip install deflate
# python3.12
>>> import deflate
>>> import gzip
>>> file_path = "CC-MAIN-20180420081400-20180420101400-00000.warc.gz"
>>> with open(file_path, "rb") as file:
... file = deflate.gzip_decompress(file.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
deflate.DeflateError: Decompression failed
>>> with open(file_path, "rb") as file:
... file = gzip.decompress(file.read())
...
(Technically, I gave up waiting on that, but gzip.open(file_path).read()
did return eventually when I tried it originally).
The built-in archive utility on macOS also chokes on this file. But plain old gunzip
works, and it expands to a 3.7 GB file (fairly quickly). libdeflate
is probably failing fast due to one of the checks in https://github.com/ebiggers/libdeflate/blob/master/lib/gzip_decompress.c but I'm not sure I want to go digging into which specifically, as it's probably out of scope for this library.
In any case, libdeflate
is definitely the wrong tool for files of this size - see https://github.com/ebiggers/libdeflate?tab=readme-ov-file#api for comments from the author.
FYI, I was curious, and yes, it does work with
gzip.decompress
. It takes a long time (the zipped file is 888 MB), but it does work. While the failure is pretty much instant with deflate.$ docker run --rm -it quay.io/pypa/musllinux_1_2_x86_64 # wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz # python3.12 -m pip install deflate # python3.12
>>> import deflate >>> import gzip >>> file_path = "CC-MAIN-20180420081400-20180420101400-00000.warc.gz" >>> with open(file_path, "rb") as file: ... file = deflate.gzip_decompress(file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> deflate.DeflateError: Decompression failed >>> with open(file_path, "rb") as file: ... file = gzip.decompress(file.read()) ...
(Technically, I gave up waiting on that, but
gzip.open(file_path).read()
did return eventually when I tried it originally).
Hi @henryiii,
The file will work with gzip.decompress
. I tested with a variety of warc.gz
files. Each time it was incredibly slow.
Doing something very standard like:
with open("my warc.gz", "rb"):
byte_stream = io.BtyesIO(file.read())
Works much faster to decompress/read/stream/open.
The built-in archive utility on macOS also chokes on this file. But plain old
gunzip
works, and it expands to a 3.7 GB file (fairly quickly).libdeflate
is probably failing fast due to one of the checks in https://github.com/ebiggers/libdeflate/blob/master/lib/gzip_decompress.c but I'm not sure I want to go digging into which specifically, as it's probably out of scope for this library.In any case,
libdeflate
is definitely the wrong tool for files of this size - see https://github.com/ebiggers/libdeflate?tab=readme-ov-file#api for comments from the author.
Hi @dcwatson,
Thank you for the additional information. Given that this is performance-sensitive at scale, every additional second for decompression adds up. So having something that can decompress on average faster is quite meaningful. It looks like deflate may not be the best for this case then given the size of the .gz
files.
I have been benchmarking a few other decompression libraries.
@dcwatson @henryiii Do either of you have thoughts on libraries worth testing?
Thank you.
I don't actually know anything about decompression[^1], I just helped with the packaging. :)
[^1]: I've technically done a little research in algorithms a long time ago, but not familiar with libraries
Hello,
I am receiving this error:
Reproduce:
Here is the code:
The Python version is 3.12 and Ubuntu 22.04.
File is
.gz
.Thank you.