aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.11k stars 546 forks source link

Option to only report first occurrence of a license of a kind #464

Closed sschuberth closed 5 years ago

sschuberth commented 7 years ago

For license provenance purposes it often is enough to prove that at least one file under a specific license is present. In order to make the result set smaller and easier to review, how about adding an option to only keep the first occurrence of a license of a kind?

For example, if file A is GPL, and file also B is GPL, only report file A.

In other cases, I'm not yet entirely sure how the feature should work: If file A has both BSD and Apache-2.0, file B has only BSD, and file C has only Apache-2.0, would you only report file A as that's sufficient to cover the list of different licenses, or should you report B and C to clarify the licenses might appear in different files, or report all three files in this case?

pombredanne commented 7 years ago

This makes a lot of sense and as you explain the devils is in the details... IMHO this would be best handled as part #377 ... would you agree? e.g. the goal would not be to make the scan results smaller per se, but to make them easier to review hence apply some smarts on top of results to get some sorts of summarization (one of which could be the one you mentioned here)

pombredanne commented 7 years ago

Note that if you want to be comprehensive, you may also need to collect the copyrights. And there are cases where you will have various combos of licenses and copyrights present or not in various files of the same codebase. Again, all these small cases matter I think.

sschuberth commented 7 years ago

IMHO this would be best handled as part #377 ... would you agree?

That sounds reasonable, yes.

the goal would not be to make the scan results smaller per se, but to make them easier to review hence apply some smarts on top of results

How would you do this e.g. for SPDX tag-value output? IMO you cannot, so you would really have to omit data from the result file in this case.

various combos of licenses and copyrights present

Good point. So you would basically need to de-duplicate on copyright-license tuples instead of just licenses.

pombredanne commented 5 years ago

Since we now report license expressions, this is IMHO no longer relevant. Would you agree?

sschuberth commented 5 years ago

Just to document the current state: license_expressions (plural) are currently reported once per file, and multiple findings of the exact same license are de-duplicated in that list. So that's different from what I was originally asking in this ticket.

However, I hereby withdraw my original proposal as I believe ScanCode should always provide the full picture and report all findings. If picking an (arbitrary or specific) occurrence of a license finding is enough for someone's use case, that picking should be done by some custom tooling provided by the user as a post-processor to ScanCode's results.