aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.11k stars 546 forks source link

--only-findings doesn't filter out uninteresting files in --package mode #3700

Open ben-c8y opened 7 months ago

ben-c8y commented 7 months ago

Description

When using --only-findings together with --package, the only-findings filtering doesn't take effect since the for_packages attribute is always set. This rather defeats the purpose and leads to output files that are 100s of times bigger than really needed.

I suggest that https://github.com/nexB/scancode-toolkit/blob/develop/src/scancode/plugin_only_findings.py needs an exception to ignore for_packages

How To Reproduce

e.g. scancode --package --copyright --license --license-text --only-findings --classify --summary --tallies --license-clarity-score --json-pp=scancode-package-and-only-findings.json PATH

where PATH could be extracted from something like https://repo1.maven.org/maven2/log4j/log4j/1.2.17/log4j-1.2.17.jar

The resulting json is about 10,000 lines (instead of 1000 you'd get without the packages option) due to uninformative items like:

    {
      "path": "log4j-1.2.17.jar-extract/org/apache/log4j/Appender.class",
      "type": "file",
      "package_data": [],
      "for_packages": [
        "pkg:maven/log4j/log4j@1.2.17?uuid=03c476c1-0273-4156-ba39-639b19b337c5"
      ],
      "is_legal": false,
      "is_manifest": false,
      "is_readme": false,
      "is_top_level": false,
      "is_key_file": false,
      "detected_license_expression": null,
      "detected_license_expression_spdx": null,
      "license_detections": [],
      "license_clues": [],
      "percentage_of_license_text": 0,
      "copyrights": [],
      "holders": [],
      "authors": [],
      "scan_errors": []
    },

System configuration

pombredanne commented 7 months ago

@ben-sag Thanks for the report. The files are reported as being part of the package because they are part of it, and we treat this as a finding (actually a rather important one IMHO).

What would you expect? If you have another context where log4j was part of a larger scan, only its own files would be reported as belonging to log4j.

See attached scan (in YAML) (2493 lines): log4j-1.2.17.yaml.txt

ben-c8y commented 7 months ago

Thanks. I think for the use case where you are happy for every single file to be listed (e.g. to find out what package each is part of) you simply wouldn't specify --only-findings. People who specify this only-findings option are doing so because they want to trim down the set of files (it's a factor of 10 difference in file size!) and the for_packages logic defeats this entirely. It all depends on what definition we pick for what is a "finding" and I'd argue the useful choice here is the licencey/copyright-y stuff but not basic things like for_packages which are often present on a huge number of files that you don't care about.

pombredanne commented 7 months ago

@ben-sag fair enough. Personally I seldom use --only-findings, but I can see how it can help in some cases and I am fine to exclude package files that have no other clues. Can I interest you in a PR?

Note that a likely better option is to use ScanCode.io in the case where you still want a thorough scanning (including code matching to the PurlDB) but may prefer a streamlined reporting. And/or important your scans in DejaCode.