lrunzip multi-thread failures

ckolivas / lrzip

Long Range Zip

http://lrzip.kolivas.org

GNU General Public License v2.0

615 stars 76 forks source link

lrunzip multi-thread failures #122

Closed GitBillW closed 3 years ago

GitBillW commented 5 years ago

The debian-10 build of lrunzip-0.631 (built 2018/05/29) often fails when running in multi-threaded mode. This looks like a regression introduced some time after debian-9 (lrzip built 2016/11/14). Most of these problems don't happen with my own lrunzip (built from a clone of https://github.com/ckolivas/lrzip.git), so I guess I need to report this as a debian-10 problem.

However, I have one simple and repeatable test case that is still broken in the latest lrzip code (as of today, 2019/07/25). With an input file consisting of 4096 binary zeroes, the output of lrzip looks like this:

00000000: 4c52 5a49 0006 0010 0000 0000 0000 0000 LRZI............ 00000010: 5d00 0000 0101 0000 0201 0010 0300 0000 ]............... 00000020: 002f 0003 0000 0000 0e00 061a 0000 1040 ./.............@ 00000030: 0000 006f fdff ffa3 b7ff 473e 4815 7239 ...o......G>H.r9 00000040: 6151 b892 28e6 a384 9e29 4603 0a00 0a00 aQ..(....)F..... 00000050: 0000 0000 1000 0000 0000 0000 0300 0000 ................ 00000060: 0000 0062 0f0b 67a9 1f7f 7415 1bc5 be74 ...b..g...t....t 00000070: 5b71 10 [q.

If I try to lrunzip that data with multiple threads I get no output, and this -vv trace:

Decompressing... Reading chunk_bytes at 24 Expected size: 4096 Chunk byte width: 2 Reading eof flag at 25 EOF: 1 Reading expected chunksize at 26 Chunk size: 4096 Reading stream 0 header at 29 Reading stream 1 header at 36 Reading ucomp header at 76 Fill_buffer stream 0 c_len 10 u_len 10 last_head 0 Starting thread 0 to decompress 10 bytes from stream 0 Thread 0 decompressed 10 bytes from stream 0 Taking decompressed data from thread 0 Reading ucomp header at 43 Fill_buffer stream 1 c_len 26 u_len 4096 last_head 64 Starting thread 1 to decompress 26 bytes from stream 1 Reading ucomp header at 93 ---> Invalid data compressed len 0 uncompressed 0 last_head 0 <--- (fatal_return in stream.c)

Adding a -p1 option makes lrunzip work correctly. Strangely, if I comment out the test in stream.c (around line 1636) that is responsible for the fatal_return, then lrunzip works correctly even in multi-thread mode ... but obviously that isn't a proper fix.

I get similar failures with any 4096 byte file, BTW - I've just shown a zero-filled file because the lrzip output is obviously very small. Still fails in the same way if I populate the input file from /dev/urandom.

pete4abw commented 5 years ago

While this (or any) error should not occur, I seem to recall something about minimum dictionary sizes in lzma of 4096 bytes. Anything decompressed size under that may cause problems. Perhaps a proper fix would be to stop the thread from being decompressed based on data size. Or, if compressed size is smaller than X or the uncompressed size is smaller than Y force single thread. Lots of testing for threshold amounts will be needed although quite honestly, small files have no business being compressed with lzma or lrzip.

GitBillW commented 5 years ago

Needless to say I don't make a practice of running lrzip on 4KB files (let alone ones with zero content)! This was just the simplest repeatable test case I discovered when investigating lrunzip failures on the debian-10 build of the lrzip package.

What I was really bothered about were failures seen when trying to lrunzip some of my backups - pretty scary, until I found that it worked OK if forced into single-thread mode. More recent lrzip builds (later than the debian-10 version) also work OK, even in multi-threaded mode.

I'm not familiar with the lrzip code, so I don't know if my 4KB file failure is a unique pathological case, or an instance of a more general problem. That's why I thought it worth posting the bug report here.

pete4abw commented 5 years ago

This is a post decompression failure when the decompressed data is expanded into an unrzipped stream. There is a case when the read_len is > 0 but all the uncompressed variables show 0 which makes it look like an error when it is not. I included a test for read_len in the line you identified (thank you!). I tested small files and big ones using different compressions. But it needs more testing. Here the patch and a patch file. Keep us posted.

--- /home/peter/projects/git/lrzip/stream.c     2019-08-26 15:39:08.344885614 -0500
+++ ./stream.c  2019-10-11 07:16:53.753965748 -0500
@@ -1565,6 +1565,7 @@
 static int fill_buffer(rzip_control *control, struct stream_info *sinfo, struct stream *s, int streamno)
 {
        i64 u_len, c_len, last_head, padded_len, header_length, max_len;
+       int read_len;
        uchar enc_head[25 + SALT_LEN], blocksalt[SALT_LEN];
        stream_thread_struct *st;
        uchar c_type, *s_buf;
@@ -1604,7 +1605,6 @@
                last_head = last_head32;
                header_length = 13;
        } else {
-               int read_len;

                print_maxverbose("Reading ucomp header at %lld\n", get_readseek(control, sinfo->fd));
                if ((control->major_version == 0 && control->minor_version < 6) || ENCRYPT)
@@ -1633,7 +1633,8 @@
        last_head = le64toh(last_head);
        /* Check for invalid data and that the last_head is actually moving
         * forward correctly. */
-       if (unlikely(c_len < 1 || u_len < 1 || last_head < 0 || (last_head && last_head <= s->last_head))) {
+       if (unlikely(read_len < 1 && 
+               (c_len < 1 || u_len < 1 || last_head < 0 || (last_head && last_head <= s->last_head)))) {
                fatal_return(("Invalid data compressed len %lld uncompressed %lld last_head %lld\n",
                             c_len, u_len, last_head), -1);
        }

smallfile.patch.gz

GitBillW commented 5 years ago

I tried your patch and it appeared to be working OK ... however, I have found that if I run lrunzip multiple times, with exactly the same input, then it sometimes seg faults. Unfortunately I can't be more specific than "sometimes", as I can't see any pattern to the failures. My test data is as described in my first comment above: ie the result of running lrzip on a 4KB file populated from /dev/zero.

Valgrind tells me the lrunzip seg fault is at stream.c:1519, ie this line:

if (uci->c_type != CTYPE_NONE) {

I inserted a debug line to see what's going on:

uci = &ucthread[i];
fprintf(stderr, "i=%d uci=%lu\n", i, (unsigned long)uci); fflush(stderr);

When it works, the debug output looks like this:

i=0 uci=94774476441008
i=1 uci=94774476441056
i=2 uci=94774476441104

and when it seg faults, it's like this:

i=0 uci=94864767026608
i=1 uci=94864767026656
i=2 uci=96

The big numbers differ between runs, but when it seg faults the last uci value is always 96. For what it's worth, even when it seg faults, the uncompressed output is complete and correct, although in the seg faulting case the output file timestamp doesn't get set to match the timestamp of the original input file.

pete4abw commented 5 years ago

EDIT: Issue 123 (https://github.com/ckolivas/lrzip/issues/123) is related to this issue. See comment there. It appears that the stream.c fix may be working. But the memory problem persists.

Good work, thank you. Your error seems to be related to the thread problem you uncovered. My fix may not be a fix afterall.

More study is needed. For the time being, I suggest using lrzip normally and not with files one would never compress. The spurious error is very unusual.

On Fri, 11 Oct 2019 09:46:37 -0700 GitBillW notifications@github.com wrote:

I tried your patch and it appeared to be working OK ... however, I have found that if I run lrunzip multiple times, with exactly the same input, then it sometimes seg faults. Unfortunately I can't be more specific than "sometimes", as I can't see any pattern to the failures. My test data is as described in my first comment above: ie the result of running lrzip on a 4KB file populated from /dev/zero.

Valgrind tells me the lrunzip seg fault is at stream.c:1519, ie this line:
if (uci->c_type != CTYPE_NONE) {
I inserted a debug line to see what's going on:
uci = &ucthread[i];
fprintf(stderr, "i=%d uci=%lu\n", i, (unsigned long)uci);
fflush(stderr);

When it works, the debug output looks like this:
i=0 uci=94774476441008
i=1 uci=94774476441056
i=2 uci=94774476441104
and when it seg faults, it's like this:
i=0 uci=94864767026608
i=1 uci=94864767026656
i=2 uci=96
The big numbers differ between runs, but when it seg faults the last uci value is always 96. For what it's worth, even when it seg faults, the uncompressed output is complete and correct, although in the seg faulting case the output file timestamp doesn't get set to match the timestamp of the original input file.

-- Peter Hyman (609)598-0262 (612)440-7383 (Pete)

jlittlenz commented 4 years ago

I suspect I have hit this crash on Ubuntu 19.10; the comment "This is a post decompression failure when the decompressed data is expanded into an unrzipped stream" seems to apply from this -vv output:

....
Taking decompressed data from thread 4
Reading ucomp header at 86550598                        4:0%
Fill_buffer stream 1 c_len 2579624 u_len 95372668 last_head 103922820
Starting thread 4 to decompress 2579624 bytes from stream 1
Taking decompressed data from thread 5
Reading ucomp header at 103922827                               5:0%
Fill_buffer stream 1 c_len 1068205 u_len 36379022 last_head 0
Starting thread 5 to decompress 1068205 bytes from stream 1
Taking decompressed data from thread 6
Taking decompressed data from thread 7                                  6:0%
Taking decompressed data from thread 8
Taking decompressed data from thread 9
Taking decompressed data from thread 1
Taking decompressed data from thread 2
zsh: segmentation fault (core dumped)

This happens every time with lrzip -d; lrzip -d -p1 is ok, but very slow.

I built from git here and there's no problem.

The file is about 2 GiB of source code which lrzip compresses to 100 MiB; I'd have thought that was "normal" use of lrzip.

I am confused by the versioning, or lack of such. APT says the version is 0.631+git180528-1 but "180528" isn't in the git log. Maybe the debian packagers have cherry-picked fixes.

I'm really commenting here to let someone know it seems to be fixed, and in the off chance it might help poor sods who can't read their backups.

pete4abw commented 4 years ago

I am confused by the versioning, or lack of such. APT says the version is 0.631+git180528-1 but "180528" isn't in the git log. Maybe the debian packagers have cherry-picked fixes.

John, this debian package was made on 2018-05-28. Really old, and this bug was addressed as you discovered. CF: Debian Changelog.

jlittlenz commented 4 years ago

John, this debian package was made on 2018-05-28. Really old, and this bug was addressed as you discovered. CF: Debian Changelog.

Thank you. i'd looked at the debian change log, but just found it confusing. I have raised a bug on the Ubuntu package suggesting getting the fix from upstream; I don't know if that's the way things are done.

pete4abw commented 4 years ago

@jlittlenz , I emailed the dev directly, encouraging him to pull a new git. @ckolivas , maybe it's time for a little version bump? Cheers. :-)