Open sutula opened 3 years ago
@sutula Thank you for the report.
There are two issues:
It seems that there is an extra newline (0x0a
) at the end of the file. After removing that gunzip
doesn't report an error.
It seems that it wasn't intentional: https://github.com/ruby/rake/commit/ea834c6afde1bdb25edc4ba82fc0ae0c4c4df2ce
but that doesn't mean that scancode shouldn't be able to handle it. It might be better to switch to some sort of streaming unpacking of gzip files.
While the discussion continues on how to best handle these cases, is there a suggestion for how to work around the problem, such as a small patch that could allow unpack to continue with the non-corrupt content?
@armijnhemel thank you ++ for this awesome research!
@sutula
While the discussion continues on how to best handle these cases, is there a suggestion for how to work around the problem, such as a small patch that could allow unpack to continue with the non-corrupt content?
Yes, it should not crash at all.
Note that for the extraction of a plain gzip files we have on hand:
And this is what is used today and it had provisions for trailing garbage in https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/uncompress.py#L101
So this is a bug and a regression.
@sutula I tried this setup:
$ wget https://github.com/nexB/scancode-toolkit/releases/download/v30.1.0/scancode-toolkit-30.1.0_py37-linux.tar.xz
$ pyenv install 3.7.3
$ tar -xf scancode-toolkit-30.1.0_py37-linux.tar.xz
$ cd scancode-toolkit-30.1.0/
$ pyenv local 3.7.3
$ ./configure
$ wget https://github.com/ruby/rake/archive/refs/tags/v0.9.2.2.zip
$ ./extractcode --verbose rake-0.9.2.2.zip
Extracting archives...
Extracting: rake-0.9.2.2.zip
Extracting: rake-0.9.2.2.zip
Extracting: rake.1.gz
Extracting: rake.1.gz
ERROR extracting: /home/pombreda/sc301/scancode-toolkit-30.1.0/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz: Not a gzipped file (b'\n')
Extracting done.
I am not able to reproduce the crash. So I may be missing something in my setup? That said I note something peculiar in your output log:
Traceback (most recent call last):
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/bin/extractcode", line 8, in <module>
sys.exit(extractcode())
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/cliutils.py", line 75, in main
**extra,
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/cli.py", line 301, in extractcode
for xev in extraction_events:
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/cliutils.py", line 170, in generator
yield from self.iter
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/api.py", line 53, in extract_archives
ignore_pattern=ignore_pattern,
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/extractcode/extract.py", line 144, in extract
fileutils.copytree(target, source)
File "/home/sutula/Scancode/scancode-toolkit-30.1.0/lib/python3.7/site-packages/commoncode/fileutils.py", line 391, in copytree
names = os.listdir(src)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/trash/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz-extract'
The last line is about files in '/tmp/trash/rake-0.9.2.2.zip-extract/rake-0.9.2.2/doc/rake.1.gz-extract
... could this be because some files may have been moved or trashed while extracting?
I apologize for not providing the entire extract command. The difference between your result and mine seems to be that I use the "--replace-originals" option. So some code path when --replace-originals is specified isn't checking for an unpack failure before proceeding to "replace the originals" with extracted content, resulting in the crash I see.
Thank-you very much, though, for trying this. I asked for a workaround, and not specifying "--replace-originals" seems ideal. It's a little more work for the calling script, but allows analysis work to move forward until this is fixed in a released version.
As a fix is created, can I recommend that such "failures to unpack" result in a different diagnostic from extractcode? Even running extractcode as you did, the return value is "1" indicating a failure. While there is an unpack failure, I wonder whether users would benefit from knowing the difference from some absolute failure (can't open input file, command line parameter error, etc.) and the case we're discussing, where most of the archive is recursively unpacked but some unpacks had errors. For example, the scripts I write to unpack and analyze source code would benefit from a return value of "2" if at least some of the archive got unpacked. But perhaps this is splitting hairs...just consider this a suggestion.
Procedurally, shall we split this issue into two?
@sutula Thanks! it makes more sense to me now and I can reproduce alright. There is indeed a class of extraction issues which are more in the realm of a warning like "we tried everything we could to extract this file and we cannot do much more, this is either a corrupted or truncated archive or not really an archive, look into it" ... (well may be not as verbose, but that's the gist of it). So there are really three things here:
We can keep this in a single issue or not. We may want to move this to extractcode too, your call.
Now like is often the case, a bug or issue may only be a symptom for a deeper issue... here I wonder if the underlying issue is not "--replace-originals" ... why did you use this? Space saving? Rather I would assume that's because you did not want to scan things twice: both the archives and their content is one scan too many. So may be there should be something else to consider: since we use a reasonably unique convention of always adding a -extract
suffix to extractcode
-extracted directories, may there should be a feature of "no-double-scan-of-extracted-archives" where when the conditions of having both a foo.tgz-extract
directory and a foo.tgz
archive, only foo.tgz-extract
is scanned ?
Just some food for thoughts!
@sutula see https://github.com/nexB/extractcode/pull/33
Thanks for the quick patch. I believe it solves the extractcode --replace-originals issue. See patch discussion.
Other suggestions and potential problems were raised in discussion of this issue. I propose to split the other items out into separate issues, leaving this issue only for the originally-intended extractcode crash. Unless there is objection, I'll submit these new issues by early next week.
@sutula that's a perfect plan. :+1:
Issues #2740 and #2741 have been submitted. I believe this covers all the extra discussion in this issue. I believe this issue can now be closed given the commit that fixes the original issue.
Description
In normal use of extractcode, I ran across a number of cases where extractcode crashes. So far, I've investigated one of these cases.
The target archive is here. Within the Ruby Rake zipfile is what appears to be a compressed documentation file, rake-0.9.2.2/doc/rake.1.gz which could be directly downloaded here if you wish.
Manually trying to unzip the offending rake.1.gz file, I get:
gunzip rake.1.gz gzip: rake.1.gz: unexpected end of file
echo $? 1
Running extractcode on the entire v0.9.2.2.zip produces the output in the attached file, basically a crash when extractcode attempts to copy the extracted output into the destination tree. I'm guessing this is a case where a failure of the archive unpack utility is not detected.
What should be done in cases like this? As a user, I'd like to know that portions of the extract failed, in case I want to investigate further. But in the mean time, I'd like to see extract proceed, continuing to try to extract other portions of the the archive. The end goal is license analysis of the package, and it's helpful to have license results for the non-corrupt portions of the package, even if one particular file or subdirectory is not able to be scanned.
My test system is: Debian Linux, python 3.7.3, scancode-toolkit 30.1.0, installed from download of released Linux tar archive