Open ben-c8y opened 7 months ago
@ben-sag Thanks for the report. The files are reported as being part of the package because they are part of it, and we treat this as a finding (actually a rather important one IMHO).
What would you expect? If you have another context where log4j was part of a larger scan, only its own files would be reported as belonging to log4j.
See attached scan (in YAML) (2493 lines): log4j-1.2.17.yaml.txt
Thanks. I think for the use case where you are happy for every single file to be listed (e.g. to find out what package each is part of) you simply wouldn't specify --only-findings. People who specify this only-findings option are doing so because they want to trim down the set of files (it's a factor of 10 difference in file size!) and the for_packages logic defeats this entirely. It all depends on what definition we pick for what is a "finding" and I'd argue the useful choice here is the licencey/copyright-y stuff but not basic things like for_packages which are often present on a huge number of files that you don't care about.
@ben-sag fair enough. Personally I seldom use --only-findings, but I can see how it can help in some cases and I am fine to exclude package files that have no other clues. Can I interest you in a PR?
Note that a likely better option is to use ScanCode.io in the case where you still want a thorough scanning (including code matching to the PurlDB) but may prefer a streamlined reporting. And/or important your scans in DejaCode.
Description
When using --only-findings together with --package, the only-findings filtering doesn't take effect since the for_packages attribute is always set. This rather defeats the purpose and leads to output files that are 100s of times bigger than really needed.
I suggest that https://github.com/nexB/scancode-toolkit/blob/develop/src/scancode/plugin_only_findings.py needs an exception to ignore for_packages
How To Reproduce
e.g.
scancode --package --copyright --license --license-text --only-findings --classify --summary --tallies --license-clarity-score --json-pp=scancode-package-and-only-findings.json PATH
where PATH could be extracted from something like
https://repo1.maven.org/maven2/log4j/log4j/1.2.17/log4j-1.2.17.jar
The resulting json is about 10,000 lines (instead of 1000 you'd get without the packages option) due to uninformative items like:
System configuration