ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines

Description

After scanning the Qt code base, scancode failed to run the final steps:

PS C:\My\scancode-toolkit-30.0.0> d:\MyScript.ps1
Setup plugins...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls, generated with 30 process(es)...
[--------------------] 0

The following message repeated many times:
c:\my\scancode-toolkit-30.0.0\src\cluecode\copyrights.py:3382: FutureWarning: Possible set difference at position 3
  remove_tags = re.compile(

[####################] 29462
ERROR: failed to run post-scan plugin: consolidate:
Traceback (most recent call last):
  File "c:\my\scancode-toolkit-30.0.0\src\scancode\cli.py", line 1057, in run_codebase_plugins
    plugin.process_codebase(codebase, **kwargs)
  File "c:\my\scancode-toolkit-30.0.0\src\summarycode\plugin_consolidate.py", line 159, in process_codebase
    consolidations.extend(get_consolidated_packages(codebase))
  File "c:\my\scancode-toolkit-30.0.0\src\summarycode\plugin_consolidate.py", line 239, in get_consolidated_packages
    for _, holder, _, _ in CopyrightDetector().detect(numbered_lines,
TypeError: cannot unpack non-iterable Detection object

C:\My\scancode-toolkit-30.0.0\lib\site-packages\fingerprints\cleanup.py:54: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)

How To Reproduce

Download and extract Qt from https://www.qt.io/download
Install Python 3.9
Download and extract scancode 30.0.0 to C:\My\scancode-toolkit-30.0.0
Run scancode once to set it up
run "C:\My\scancode-toolkit-30.0.0\Scripts\scancode" -clpieu --license-score 60 --license-text --license-text-diagnostics --only-findings --strip-root --classify --json D:\SourceCodeLicenses.json --summary --generated --consolidate -n 30 --ignore-author "\.rc$|::|$User Name$ CString|AppDomainManager|Read\. DataRecord|^the [A-Z][A-Za-z ]+$|Fred Flintstone$FST$|Microsoft Visual|Cortana" --ignore-copyright-holder "Microsoft|BCGSoft|Cortana|Basler.*(Basler|Vision)|Allied Vision|Stemmer" --ignore */BCG/* --ignore */ConfigurationManagement/Certificate.pfx/* --ignore */Salut/* --ignore */doc/* --ignore */tutorials/* --ignore *.acf --ignore *.appxmanifest --ignore *.aux --ignore *.bin --ignore *.bmp --ignore *.config --ignore *.cur --ignore *.dat --ignore *.db --ignore *.def --ignore *.hlsl --ignore *.hlsli --ignore *.ico --ignore *.ifc --ignore *.ilk --ignore *.ipch --ignore *.ism --ignore *.jpg --ignore *.lib --ignore *.manifest --ignore *.mc --ignore *.metagen --ignore *.mp4 --ignore *.nls --ignore *.obj --ignore *.pch --ignore *.pchast --ignore *.pdb --ignore *.pfx --ignore *.png --ignore *.pri --ignore *.resfiles --ignore *.resources --ignore *.rh --ignore *.rsp --ignore *.ruleset --ignore *.snk --ignore *.svd --ignore *.tlb --ignore *.tlh --ignore *.tli --ignore *.tlog --ignore *.ver --ignore *.winmd --ignore *.xbf --ignore *.xdc --ignore *.xsd D:\ExtractedQtPackage\Qt

Note that scanning other packages (e. g. MKL from Intel) using the same command succeeds.

System configuration

AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0, downloaded and extracted to C:\My

After running the script again, I received more info, this time pointing to the file that caused the error:

TypeError: cannot unpack non-iterable Detection object
Path: x64/bin/Qt5WebEngineCored.dll

That file happens to be the largest one with 548 MB. May that size be the culprit?

And this time, scancode actually ended and produced a summary and output for the other files.

@FrankHeimes Thank you for the report. That's a sizeable DLL indeed and the likely cause for troubles. The difficulty in this case is that there is a delicate balance to find between possibly skipping such a file entirely and then missing out on some important information or finding a way to get some scan data (possible DLL metadata and basic file info) and not other (such as license and copyright details)

Another approach could be to split such large file in arbitrary chunks (say 5 to 10MB) and run scans as usual more efficiently on these fragments and have a special check if there are any scannable data and results near the chunk boudndaries that would need restiching and rescanning some chunk regions.

Yet another one could be to have a command line option to skip file above a certain size entirely.

What would be your take there?

@pombredanne IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.

Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:

boost/typeof/vector150.hpp
boost/typeof/vector200.hpp

These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. However, some individual lines are as long as 5KB.

IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.

@FrankHeimes you are nailing it! The thing is that each format may need specific ways. But in general compressed data does not have much one can squeeze out.... but as it happens I once found GPL references in the paths from a compressed and unextracted Zip central file directory.

And I routinely find proper license and copyright in ELF and DLLs.

I guess one approach is to at least to find a way to ignore most compressed files.

Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:
boost/typeof/vector150.hpp
boost/typeof/vector200.hpp
These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. However, some individual lines are as long as 5KB.

The culprit is the copyright detection on these large files. The process for this is roughly explained here: https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/cluecode/copyrights.py#L59

The process consists in:

prepare and cleanup text
identify regions of text that may contain copyright (using hints). These are called "candidates".
tag the text to recognize (e.g. lex) parts-of-speech (POS) tags to identify various copyright statements parts such as dates, companies, names ("named entities"), etc. This is done using pygmars which contains a lexer derived from NLTK POS tagger.
feed the tagged text to a parsing grammar describing actual copyright statements (also using pygmars) and obtain a parse tree.
Walk the parse tree and yield copyright statements, holder and authors with start and end line from the parse tree with some extra post-detection cleanups.

The issue is that the candidates detection is based on lines. And very long line mean very long time to lex and parse and possibly find nothing.

One solution would be to break very long lines in chunks, which is a strategy adopted for license detection and seen in actiion here https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/textcode/analysis.py#L138

In the short term, adding a --timeout 1000 to avoid the scan to timeout on such file would help.

On my laptop (Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz, 32GB RAM) I got vector200.hpp to scan alright with a timeout of 300 seconds:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "30.1.0",
      "options": {
        "input": [
          "vector200.hpp"
        ],
        "--copyright": true,
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true,
        "--timeout": "300.0"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-10-08T141949.792133",
      "end_timestamp": "2021-10-08T142345.461400",
      "output_format_version": "1.0.0",
      "duration": 235.66927790641785,
      "message": null,
      "errors": [],
      "extra_data": {
        "spdx_license_list_version": "3.14",
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "vector200.hpp",
      "type": "file",
      "licenses": [
        {
          "key": "boost-1.0",
          "score": 59.38,
          "name": "Boost Software License 1.0",
          "short_name": "Boost 1.0",
          "category": "Permissive",
          "is_exception": false,
          "is_unknown": false,
          "owner": "Boost",
          "homepage_url": "http://www.boost.org/users/license.html",
          "text_url": "http://www.boost.org/LICENSE_1_0.txt",
          "reference_url": "https://scancode-licensedb.aboutcode.org/boost-1.0",
          "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.LICENSE",
          "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.yml",
          "spdx_license_key": "BSL-1.0",
          "spdx_url": "https://spdx.org/licenses/BSL-1.0",
          "start_line": 5,
          "end_line": 6,
          "matched_rule": {
            "identifier": "boost-1.0_21.RULE",
            "license_expression": "boost-1.0",
            "licenses": [
              "boost-1.0"
            ],
            "referenced_filenames": [
              "LICENSE_1_0.txt"
            ],
            "is_license_text": false,
            "is_license_notice": true,
            "is_license_reference": false,
            "is_license_tag": false,
            "is_license_intro": false,
            "has_unknown": false,
            "matcher": "3-seq",
            "rule_length": 32,
            "matched_length": 19,
            "match_coverage": 59.38,
            "rule_relevance": 100
          },
          "matched_text": "Use modification and distribution are subject to the boost Software License,\n// Version 1.0. (See [http]://[www].[boost].[org]/LICENSE_1_0.txt)."
        }
      ],
      "license_expressions": [
        "boost-1.0"
      ],
      "percentage_of_license_text": 0.01,
      "copyrights": [
        {
          "value": "Copyright (c) 2005 Arkadiy Vertleyb",
          "start_line": 2,
          "end_line": 2
        },
        {
          "value": "Copyright (c) 2005 Peder Holt",
          "start_line": 3,
          "end_line": 3
        }
      ],
      "holders": [
        {
          "value": "Arkadiy Vertleyb",
          "start_line": 2,
          "end_line": 2
        },
        {
          "value": "Peder Holt",
          "start_line": 3,
          "end_line": 3
        }
      ],
      "authors": [],
      "scan_errors": []
    }
  ]
}Scanning done.
Summary:        licenses, copyrights with 1 process(es)
Errors count:   0
Scan Speed:     0.00 files/sec. 
Initial counts: 1 resource(s): 1 file(s) and 0 directorie(s) 
Final counts:   1 resource(s): 1 file(s) and 0 directorie(s) 
Timings:
  scan_start: 2021-10-08T141949.792133
  scan_end:   2021-10-08T142345.461400
  setup_scan:licenses: 1.37s
  setup: 1.37s
  scan: 234.30s
  total: 235.67s

aboutcode-org / scancode-toolkit