bguise987 / pigz-python

The goal of this project is to create a pure Python implementation of the pigz project for parallelizing gzipping.
MIT License
31 stars 5 forks source link

Random failures with ~800M file #34

Open bsergean opened 3 years ago

bsergean commented 3 years ago

Hi there,

We're using your package to build .debian packages (much faster than dpkg-dev), but unfortunately we see random failures that I cannot explain. The file that is produced cannot be decompressed by zlib/dpkg.

dpkg-deb (subprocess): decompressing archive member: internal gzip read error: '<fd:4>: invalid block type'

Since the code is multi-threaded, are there places where we need a lock maybe ? Do we need a thread safe queue or something similar ?

bguise987 commented 3 years ago

Hi @bsergean thanks for reporting this issue! It's always exciting to hear about how others are using your work, and as a long time Debian user, this use case makes me happy :).

You may be on the right track there re: locks or thread safe queues.

I'm not super familiar with the process of building a .debian package. For the error message you posted, is the ending part

': invalid block type' direct output from gzip that gets piped to dpkg-deb?

If not, have you tried directly unzipping the file with gzip to see what that output is?

As well, would it be at all possible to send me some example data showing the:

Thanks!

bsergean commented 3 years ago

Hi Ben,

Unfortunately I cannot share the compression source, not the output. But what I should try to do is find a public repro case. Maybe a Debian .iso (like a 700M file) would reproduce it, and you could try to make the thread-count be something large like 32 or 64.

I will try to make the deb code code available, it’s really not the much. I was not aware but the Debian package file format is very simple, you can look it up on wikipedia. It’a an ar file, containing 3 tar.gz files essentially.

On Jun 18, 2021, at 8:23 AM, Ben Guise @.***> wrote:

Hi @bsergean https://github.com/bsergean thanks for reporting this issue! It's always exciting to hear about how others are using your work, and as a long time Debian user, this use case makes me happy :).

You may be on the right track there re: locks or thread safe queues.

I'm not super familiar with the process of building a .debian package. For the error message you posted, is the ending part

'fd:4: invalid block type' direct output from gzip that gets piped to dpkg-deb?

If not, have you tried directly unzipping the file with gzip to see what that output is?

As well, would it be at all possible to send me some example data showing the:

Compression source Resulting improperly constructed gzip file ? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nix7drummer88/pigz-python/issues/34#issuecomment-864115171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UJ3ZP5HP4SLSR2OHNTTTNQFPANCNFSM464RI2IA.

bsergean commented 3 years ago

FYI this is how I use your package -> https://github.com/bsergean/mk_deb

Unfortunately I still have not found a repro (I have not tried too hard).

bsergean commented 3 years ago

I think I found a repro, I just have a unittest that create files with random content, and then try to decompress it, and things randomly fail.

See the steps

git clone https://github.com/bsergean/mk_deb.git
cd mk_deb
python3 -mvenv venv
source venv/bin/activate
pip install pigz_python
python3 test_mk_deb.py

If I run

python3 test_mk_deb.py TestDebCreation.testCompressFile

Many time, eventually I get:

$ python3 test_mk_deb.py TestDebCreation.testCompressFile                     
2021-06-29 10:50:18,096 INFO Compressed file size: 100000 bytes
2021-06-29 10:50:18,606 INFO Compressed file size: 100000 bytes
E
======================================================================
ERROR: testCompressFile (__main__.TestDebCreation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/benjaminsergeant/src/foss/mk_deb/test_mk_deb.py", line 109, in testCompressFile
    compressFile(tempFile, 8, False, workers, blocksize)
  File "/Users/benjaminsergeant/src/foss/mk_deb/mk_deb.py", line 33, in compressFile
    size = len(f.read())
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 300, in read
    return self._buffer.read(size)
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 495, in read
    uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid distance code

----------------------------------------------------------------------
Ran 1 test in 1.549s

FAILED (errors=1)
bsergean commented 3 years ago

https://github.com/bsergean/mk_deb/blob/master/mk_deb.py#L15

The compress function does this after compression:

At the end of compressFile ...

    with gzip.open(path + ".gz") as f:
        size = len(f.read())
        logging.info(f"Compressed file size: {size} bytes")

And the unittest does

        workers = 16
        blocksize = 128

        for i in range(10):
            tempFile = os.path.join(self.tempDir, "input_file")
            with open(tempFile, "wb") as f:
                f.write(os.urandom(100 * 1000))  # 100K

            compressFile(tempFile, 8, False, workers, blocksize)
bsergean commented 3 years ago

I changed the unittest to keep the temp files.

Input goes in /tmp/input_file Output goes to /tmp/input_file.gz

FAILED (errors=1)
(venv) mk_deb$ file /tmp/input_file.gz 
/tmp/input_file.gz: gzip compressed data, was "input_file", last modified: Tue Jun 29 17:58:12 2021, from Unix, original size modulo 2^32 100000
(venv) mk_deb$ gunzip -c /tmp/input_file.gz > /dev/null 
gunzip: /tmp/input_file.gz: unexpected end of file
gunzip: /tmp/input_file.gz: uncompress failed

(venv) mk_deb$ wc -c /tmp/input_file.gz
  100069 /tmp/input_file.gz
(venv) mk_deb$ wc -c /tmp/input_file   
  100000 /tmp/input_file
bsergean commented 3 years ago

I copied/vendored the pigz python file in my repo, I'm gonna make some experiments (trying to remove the lock etc...).

bsergean commented 3 years ago

I copied/vendored the pigz python file in my repo, I'm gonna make some experiments (trying to remove the lock etc...).

First attempt here, I don't know when I'll have time to fix this. https://github.com/bsergean/mk_deb/commit/ff755b40753a3e7207814a328e769b2bd580c6a6

bguise987 commented 3 years ago

Hi @bsergean just wanted to check in here, I didn't forget about this, just have been quite busy in my work and personal lives over the last few weeks. I should be able to take a closer look at this soon!

bsergean commented 3 years ago

No worries at all. I have a workaround in place which works great ; basically I re-read the compressed gz file, which is rather fast, and compress with gzip module if there’s a problem.

And it’s the summer, so enjoy life and don’t worry about this :)

On Aug 4, 2021, at 4:00 AM, Ben Guise @.***> wrote:

Hi @bsergean https://github.com/bsergean just wanted to check in here, I didn't forget about this, just have been quite busy in my work and personal lives over the last few weeks. I should be able to take a closer look at this soon!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nix7drummer88/pigz-python/issues/34#issuecomment-892299074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UOUZP7Y4DZZKWP4WSTT3CNLZANCNFSM464RI2IA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

bguise987 commented 1 year ago

Hi @bsergean just wanted to check in here as I've recently started to look into this again. I've enjoyed a couple summers since we last interacted 😄

I've written some code which compresses a source file, moves it, decompresses it again, and compares the blake2b hash of the source and result. Doing this I've been able to recreate some errors with a PDF file and a movie file.

Interestingly, when I decompress these I'm seeing from differing output from zlib.

One thing I've noted in common is that if I get a bad file, I'm seeing the pattern 00 00 00 FF FF written prior to the gzip trailer.

bsergean commented 1 year ago

Thanks for the update. I am not actively using this module anymore, but it could be helpful again in the future.

I was actually using it to build Debian packages.

On Feb 28, 2023, at 12:58 PM, Ben Guise @.***> wrote:

Hi @bsergean https://github.com/bsergean just wanted to check in here as I've recently started to look into this again. I've enjoyed a couple summers since we last interacted 😄

I've written some code which compresses a source file, moves it, decompresses it again, and compares the blake2b hash of the source and result. Doing this I've been able to recreate some errors with a PDF file and a movie file.

Interestingly, when I decompress these I'm seeing from differing output from zlib.

One thing I've noted in common is that if I get a bad file, I'm seeing the pattern 00 00 00 FF FF written prior to the gzip trailer.

— Reply to this email directly, view it on GitHub https://github.com/nix7drummer88/pigz-python/issues/34#issuecomment-1448903872, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UIKB5QBO4BCQSTJW2TWZZRHHANCNFSM464RI2IA. You are receiving this because you were mentioned.

bsergean commented 1 year ago

With C++, there is something called TSAN (thread sanitizer) which helps debug race conditions. Maybe this could be helpful here.

On Feb 28, 2023, at 1:26 PM, Benjamin Sergeant @.***> wrote:

Thanks for the update. I am not actively using this module anymore, but it could be helpful again in the future.

I was actually using it to build Debian packages.

On Feb 28, 2023, at 12:58 PM, Ben Guise @.***> wrote:

Hi @bsergean https://github.com/bsergean just wanted to check in here as I've recently started to look into this again. I've enjoyed a couple summers since we last interacted 😄

I've written some code which compresses a source file, moves it, decompresses it again, and compares the blake2b hash of the source and result. Doing this I've been able to recreate some errors with a PDF file and a movie file.

Interestingly, when I decompress these I'm seeing from differing output from zlib.

One thing I've noted in common is that if I get a bad file, I'm seeing the pattern 00 00 00 FF FF written prior to the gzip trailer.

— Reply to this email directly, view it on GitHub https://github.com/nix7drummer88/pigz-python/issues/34#issuecomment-1448903872, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O6UIKB5QBO4BCQSTJW2TWZZRHHANCNFSM464RI2IA. You are receiving this because you were mentioned.

bguise987 commented 1 year ago

Just wanted to provide another update here--the reason this is happening is the last chunk of the file needs to be flushed from the compression object using zlib.Z_FINISH but instead it is being flushed with zlib.Z_SYNC_FLUSH.

With what's happening it's breaking the gzip standard in that the file appears to have more content but in reality it's reached the end.

I'll keep looking into this.