Open FrankHeimes opened 3 years ago
After running the script again, I received more info, this time pointing to the file that caused the error:
TypeError: cannot unpack non-iterable Detection object
Path: x64/bin/Qt5WebEngineCored.dll
That file happens to be the largest one with 548 MB. May that size be the culprit?
And this time, scancode actually ended and produced a summary and output for the other files.
@FrankHeimes Thank you for the report. That's a sizeable DLL indeed and the likely cause for troubles. The difficulty in this case is that there is a delicate balance to find between possibly skipping such a file entirely and then missing out on some important information or finding a way to get some scan data (possible DLL metadata and basic file info) and not other (such as license and copyright details)
Another approach could be to split such large file in arbitrary chunks (say 5 to 10MB) and run scans as usual more efficiently on these fragments and have a special check if there are any scannable data and results near the chunk boudndaries that would need restiching and rescanning some chunk regions.
Yet another one could be to have a command line option to skip file above a certain size entirely.
What would be your take there?
@pombredanne IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.
Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:
boost/typeof/vector150.hpp
boost/typeof/vector200.hpp
These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. However, some individual lines are as long as 5KB.
IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.
@FrankHeimes you are nailing it! The thing is that each format may need specific ways. But in general compressed data does not have much one can squeeze out.... but as it happens I once found GPL references in the paths from a compressed and unextracted Zip central file directory.
And I routinely find proper license and copyright in ELF and DLLs.
I guess one approach is to at least to find a way to ignore most compressed files.
Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:
boost/typeof/vector150.hpp boost/typeof/vector200.hpp
These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. However, some individual lines are as long as 5KB.
The culprit is the copyright detection on these large files. The process for this is roughly explained here: https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/cluecode/copyrights.py#L59
The process consists in:
The issue is that the candidates detection is based on lines. And very long line mean very long time to lex and parse and possibly find nothing.
One solution would be to break very long lines in chunks, which is a strategy adopted for license detection and seen in actiion here https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/textcode/analysis.py#L138
In the short term, adding a --timeout 1000
to avoid the scan to timeout on such file would help.
On my laptop (Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz, 32GB RAM) I got vector200.hpp to scan alright with a timeout of 300 seconds:
{
"headers": [
{
"tool_name": "scancode-toolkit",
"tool_version": "30.1.0",
"options": {
"input": [
"vector200.hpp"
],
"--copyright": true,
"--json-pp": "-",
"--license": true,
"--license-text": true,
"--license-text-diagnostics": true,
"--timeout": "300.0"
},
"notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
"start_timestamp": "2021-10-08T141949.792133",
"end_timestamp": "2021-10-08T142345.461400",
"output_format_version": "1.0.0",
"duration": 235.66927790641785,
"message": null,
"errors": [],
"extra_data": {
"spdx_license_list_version": "3.14",
"files_count": 1
}
}
],
"files": [
{
"path": "vector200.hpp",
"type": "file",
"licenses": [
{
"key": "boost-1.0",
"score": 59.38,
"name": "Boost Software License 1.0",
"short_name": "Boost 1.0",
"category": "Permissive",
"is_exception": false,
"is_unknown": false,
"owner": "Boost",
"homepage_url": "http://www.boost.org/users/license.html",
"text_url": "http://www.boost.org/LICENSE_1_0.txt",
"reference_url": "https://scancode-licensedb.aboutcode.org/boost-1.0",
"scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.LICENSE",
"scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.yml",
"spdx_license_key": "BSL-1.0",
"spdx_url": "https://spdx.org/licenses/BSL-1.0",
"start_line": 5,
"end_line": 6,
"matched_rule": {
"identifier": "boost-1.0_21.RULE",
"license_expression": "boost-1.0",
"licenses": [
"boost-1.0"
],
"referenced_filenames": [
"LICENSE_1_0.txt"
],
"is_license_text": false,
"is_license_notice": true,
"is_license_reference": false,
"is_license_tag": false,
"is_license_intro": false,
"has_unknown": false,
"matcher": "3-seq",
"rule_length": 32,
"matched_length": 19,
"match_coverage": 59.38,
"rule_relevance": 100
},
"matched_text": "Use modification and distribution are subject to the boost Software License,\n// Version 1.0. (See [http]://[www].[boost].[org]/LICENSE_1_0.txt)."
}
],
"license_expressions": [
"boost-1.0"
],
"percentage_of_license_text": 0.01,
"copyrights": [
{
"value": "Copyright (c) 2005 Arkadiy Vertleyb",
"start_line": 2,
"end_line": 2
},
{
"value": "Copyright (c) 2005 Peder Holt",
"start_line": 3,
"end_line": 3
}
],
"holders": [
{
"value": "Arkadiy Vertleyb",
"start_line": 2,
"end_line": 2
},
{
"value": "Peder Holt",
"start_line": 3,
"end_line": 3
}
],
"authors": [],
"scan_errors": []
}
]
}Scanning done.
Summary: licenses, copyrights with 1 process(es)
Errors count: 0
Scan Speed: 0.00 files/sec.
Initial counts: 1 resource(s): 1 file(s) and 0 directorie(s)
Final counts: 1 resource(s): 1 file(s) and 0 directorie(s)
Timings:
scan_start: 2021-10-08T141949.792133
scan_end: 2021-10-08T142345.461400
setup_scan:licenses: 1.37s
setup: 1.37s
scan: 234.30s
total: 235.67s
Description
After scanning the Qt code base, scancode failed to run the final steps:
How To Reproduce
C:\My\scancode-toolkit-30.0.0
"C:\My\scancode-toolkit-30.0.0\Scripts\scancode" -clpieu --license-score 60 --license-text --license-text-diagnostics --only-findings --strip-root --classify --json D:\SourceCodeLicenses.json --summary --generated --consolidate -n 30 --ignore-author "\.rc$|::|\(User Name\) CString|AppDomainManager|Read\(\)\. DataRecord|^the [A-Z][A-Za-z ]+$|Fred Flintstone\(FST\)|Microsoft Visual|Cortana" --ignore-copyright-holder "Microsoft|BCGSoft|Cortana|Basler.*(Basler|Vision)|Allied Vision|Stemmer" --ignore */BCG/* --ignore */ConfigurationManagement/Certificate.pfx/* --ignore */Salut/* --ignore */doc/* --ignore */tutorials/* --ignore *.acf --ignore *.appxmanifest --ignore *.aux --ignore *.bin --ignore *.bmp --ignore *.config --ignore *.cur --ignore *.dat --ignore *.db --ignore *.def --ignore *.hlsl --ignore *.hlsli --ignore *.ico --ignore *.ifc --ignore *.ilk --ignore *.ipch --ignore *.ism --ignore *.jpg --ignore *.lib --ignore *.manifest --ignore *.mc --ignore *.metagen --ignore *.mp4 --ignore *.nls --ignore *.obj --ignore *.pch --ignore *.pchast --ignore *.pdb --ignore *.pfx --ignore *.png --ignore *.pri --ignore *.resfiles --ignore *.resources --ignore *.rh --ignore *.rsp --ignore *.ruleset --ignore *.snk --ignore *.svd --ignore *.tlb --ignore *.tlh --ignore *.tli --ignore *.tlog --ignore *.ver --ignore *.winmd --ignore *.xbf --ignore *.xdc --ignore *.xsd D:\ExtractedQtPackage\Qt
Note that scanning other packages (e. g. MKL from Intel) using the same command succeeds.
System configuration
AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0, downloaded and extracted to C:\My