extractcode crash with invalid gzip archive

sutula commented 3 years ago

Description

In normal use of extractcode, I ran across a number of cases where extractcode crashes. So far, I've investigated one of these cases.

The target archive is here. Within the Ruby Rake zipfile is what appears to be a compressed documentation file, rake-0.9.2.2/doc/rake.1.gz which could be directly downloaded here if you wish.

Manually trying to unzip the offending rake.1.gz file, I get:

gunzip rake.1.gz gzip: rake.1.gz: unexpected end of file

echo $? 1

Running extractcode on the entire v0.9.2.2.zip produces the output in the attached file, basically a crash when extractcode attempts to copy the extracted output into the destination tree. I'm guessing this is a case where a failure of the archive unpack utility is not detected.

What should be done in cases like this? As a user, I'd like to know that portions of the extract failed, in case I want to investigate further. But in the mean time, I'd like to see extract proceed, continuing to try to extract other portions of the the archive. The end goal is license analysis of the package, and it's helpful to have license results for the non-corrupt portions of the package, even if one particular file or subdirectory is not able to be scanned.

My test system is: Debian Linux, python 3.7.3, scancode-toolkit 30.1.0, installed from download of released Linux tar archive

pombredanne commented 3 years ago

@sutula Thank you for the report.

There are two issues:

there is a crash and that's a bug
I wonder why so many compressed man pages have trailing garbage as this is something I often see

armijnhemel commented 3 years ago

It seems that there is an extra newline (0x0a) at the end of the file. After removing that gunzip doesn't report an error.

armijnhemel commented 3 years ago

It seems that it wasn't intentional: https://github.com/ruby/rake/commit/ea834c6afde1bdb25edc4ba82fc0ae0c4c4df2ce

but that doesn't mean that scancode shouldn't be able to handle it. It might be better to switch to some sort of streaming unpacking of gzip files.

sutula commented 3 years ago

While the discussion continues on how to best handle these cases, is there a suggestion for how to work around the problem, such as a small patch that could allow unpack to continue with the non-corrupt content?

pombredanne commented 3 years ago

@armijnhemel thank you ++ for this awesome research!

@sutula

While the discussion continues on how to best handle these cases, is there a suggestion for how to work around the problem, such as a small patch that could allow unpack to continue with the non-corrupt content?

Yes, it should not crash at all.

Note that for the extraction of a plain gzip files we have on hand:

https://github.com/nexB/extractcode/blob/main/src/extractcode/libarchive2.py which is streaming, but probably not in the way @armijnhemel mentioned
https://github.com/nexB/extractcode/blob/main/src/extractcode/sevenzip.py which is not streaming but works fine with trailing garbage
the standard library https://github.com/nexB/extractcode/blob/main/src/extractcode/uncompress.py#L11

And this is what is used today and it had provisions for trailing garbage in https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/uncompress.py#L101

So this is a bug and a regression.

pombredanne commented 3 years ago

@sutula I tried this setup:

$ wget https://github.com/nexB/scancode-toolkit/releases/download/v30.1.0/scancode-toolkit-30.1.0_py37-linux.tar.xz 
$ pyenv install 3.7.3
$ tar -xf scancode-toolkit-30.1.0_py37-linux.tar.xz 
$ cd scancode-toolkit-30.1.0/
$ pyenv local 3.7.3
$ ./configure
$ wget https://github.com/ruby/rake/archive/refs/tags/v0.9.2.2.zip
$ ./extractcode --verbose rake-0.9.2.2.zip 
Extracting archives...
Extracting: rake-0.9.2.2.zip
Extracting: rake-0.9.2.2.zip
Extracting: rake.1.gz
Extracting: rake.1.gz
ERROR extracting: /home/pombreda/sc301/scancode-toolkit-30.1.0/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz: Not a gzipped file (b'\n')
Extracting done.

I am not able to reproduce the crash. So I may be missing something in my setup? That said I note something peculiar in your output log:

Traceback (most recent call last):
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/bin/extractcode", line 8, in <module>
    sys.exit(extractcode())
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/cliutils.py", line 75, in main
    **extra,
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/cli.py", line 301, in extractcode
    for xev in extraction_events:
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/cliutils.py", line 170, in generator
    yield from self.iter
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/api.py", line 53, in extract_archives
    ignore_pattern=ignore_pattern,
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/extract.py", line 144, in extract
    fileutils.copytree(target, source)
  File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/fileutils.py", line 391, in copytree
    names = os.listdir(src)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/trash/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz-extract'

The last line is about files in '/tmp/trash/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz-extract ... could this be because some files may have been moved or trashed while extracting?

sutula commented 3 years ago

I apologize for not providing the entire extract command. The difference between your result and mine seems to be that I use the "--replace-originals" option. So some code path when --replace-originals is specified isn't checking for an unpack failure before proceeding to "replace the originals" with extracted content, resulting in the crash I see.

Thank-you very much, though, for trying this. I asked for a workaround, and not specifying "--replace-originals" seems ideal. It's a little more work for the calling script, but allows analysis work to move forward until this is fixed in a released version.

As a fix is created, can I recommend that such "failures to unpack" result in a different diagnostic from extractcode? Even running extractcode as you did, the return value is "1" indicating a failure. While there is an unpack failure, I wonder whether users would benefit from knowing the difference from some absolute failure (can't open input file, command line parameter error, etc.) and the case we're discussing, where most of the archive is recursively unpacked but some unpacks had errors. For example, the scripts I write to unpack and analyze source code would benefit from a return value of "2" if at least some of the archive got unpacked. But perhaps this is splitting hairs...just consider this a suggestion.

Procedurally, shall we split this issue into two?

Extract --replace-originals crashes when archives are corrupt
Try to be more tolerant of archives where the only issue is an extra newline tacked onto the end.

pombredanne commented 3 years ago

@sutula Thanks! it makes more sense to me now and I can reproduce alright. There is indeed a class of extraction issues which are more in the realm of a warning like "we tried everything we could to extract this file and we cannot do much more, this is either a corrupted or truncated archive or not really an archive, look into it" ... (well may be not as verbose, but that's the gist of it). So there are really three things here:

Bug: Extract --replace-originals crashes when archives are corrupt
Try to be more tolerant of archives where the only issue is an extra newline tacked onto the end.
Distinguish between errors that are crashing extraction entirely and "cannot extract" warnings as proposed above

We can keep this in a single issue or not. We may want to move this to extractcode too, your call.

pombredanne commented 3 years ago

Now like is often the case, a bug or issue may only be a symptom for a deeper issue... here I wonder if the underlying issue is not "--replace-originals" ... why did you use this? Space saving? Rather I would assume that's because you did not want to scan things twice: both the archives and their content is one scan too many. So may be there should be something else to consider: since we use a reasonably unique convention of always adding a -extract suffix to extractcode-extracted directories, may there should be a feature of "no-double-scan-of-extracted-archives" where when the conditions of having both a foo.tgz-extract directory and a foo.tgz archive, only foo.tgz-extract is scanned ? Just some food for thoughts!

pombredanne commented 3 years ago

@sutula see https://github.com/nexB/extractcode/pull/33

sutula commented 3 years ago

Thanks for the quick patch. I believe it solves the extractcode --replace-originals issue. See patch discussion.

Other suggestions and potential problems were raised in discussion of this issue. I propose to split the other items out into separate issues, leaving this issue only for the originally-intended extractcode crash. Unless there is objection, I'll submit these new issues by early next week.

pombredanne commented 3 years ago

@sutula that's a perfect plan. :+1:

sutula commented 3 years ago

Issues #2740 and #2741 have been submitted. I believe this covers all the extra discussion in this issue. I believe this issue can now be closed given the commit that fixes the original issue.

aboutcode-org / scancode-toolkit

extractcode crash with invalid gzip archive #2723