aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 548 forks source link

Improve handling of license exceptions #2855

Open pombredanne opened 2 years ago

pombredanne commented 2 years ago

There are cases where a notice may not be amenable to a clean detection such as when we would have:

both detected separately. Yet these we could instead report a single gpl-2.0 WITH classpath-exception-2.0.

To achieve this we should IMHO:

  1. use the upcoming LicenseDetection post-processing approach
  2. tag all exceptions with the list of license keys that they would typically except
  3. when we have a detected solo exception preceded by license tag or notice of its excepted license then we could return the combined expression as a new detection from the two separate detections.
sschuberth commented 9 months ago

I believe we're running into the same issue with ScanCode 32.0.8 scanning this 3RD-PARTY-NOTICES.txt file, which yields this scancode-result.json that contains the expression

"gpl-2.0 AND classpath-exception-2.0"

in several places.

Strict SPDX expression parsers like the one in ORT will fall over this (even after mapping ScanCode license keys to SPDX license IDs) as classpath-exception-2.0 (or its SPDX equivalent Classpath-exception-2.0 with upper-case "C") is not a stand-alone license, but an exception that must only be used as a right-hand operand to the WITH operator.

We actually have some post-processing of license findings in ORT to fix this up, but it's non-trivial to get this right for cases with long / nested AND / OR expressions in which the license name and belonging exception name are not listed next to each other.

So obviously, it would be best to get this fixed upstream in ScanCode itself.

AyanSinhaMahapatra commented 9 months ago

Related issue: proposed new attributes for scancode licenses: https://github.com/nexB/scancode-toolkit/issues/3484 After this is the case and we have mappings like this in the LICENSE files (for example a pointer to gpl-2.0 in the classpath-exception-2.0 license exception), then we would add a step in the license detection post-processing. But this is TBD and is still being discussed if this is the right way.

There is also the related issue of LicenseRef vs AdditionRef discussion in SPDX 3.0 which is related to this, and needs discussion.

@sschuberth in the ORT implementation, if I understand correctly you do not have a mapping of exceptions like this to associate the exceptions with the licenses, but do it just based on proximity right?

AyanSinhaMahapatra commented 9 months ago

We should also update rules to have a single rule for gpl-2.0 WITH classpath-exception-2.0 so this is treated correctly meanwhile.

sschuberth commented 9 months ago

if I understand correctly you do not have a mapping of exceptions like this to associate the exceptions with the licenses, but do it just based on proximity right?

No, it's not only proximity, but we also take our exception mapping into account to create valid license-exception combinations.

pombredanne commented 9 months ago

@sschuberth Thanks for the report!

We actually have some post-processing of license findings in ORT to fix this up, but it's non-trivial to get this right for cases with long / nested AND / OR expressions in which the license name and belonging exception name are not listed next to each other.

IMHO you should report ALL AND ANY license detection issue here (and request any ORT user to do so too). Otherwise, there are no improvements possible.

Now on the funny side, it is 100% clear to me that https://github.com/nordic-institute/X-Road/blob/0f04331e2675428a25d37aee735686cd22bc4e16/src/3RD-PARTY-NOTICES.txt was generated in part using ScanCode. This observation is based on the copyright and license reported where I spotted a few specific behaviors that are the clues that this was done with ScanCode. All the license texts are also matched exactly to ScanCode license texts, and all copyrights are normalized as ScanCode normalizes. I could likely even find which version of ScanCode they used.

Now, in this specific case, I surmise that they actually generated the attribution from ScanCode and assembled side-by-side the GPL and Classpath exception reference texts from ScanCode itself that we then detect together side-by-side in the same file.

We could simply do as Ayan suggested with a new rule and there is a circular danger there: this is assembled from ScanCode in a peculiar manual way and adding more rules should be done carefully as this could spiral!

We could also have a more specific way to handle exceptions and their excepted licenses as a new "detection".

sschuberth commented 9 months ago

There is also the related issue of LicenseRef vs AdditionRef discussion in SPDX 3.0 which is related to this, and needs discussion.

For reference, that's this discussion. TL;DR, SPDX 3.0 will use the AdditionRef- prefix for right-hand side operands to WITH that are not core exceptions.

pombredanne commented 9 months ago

L;DR, SPDX 3.0 will use the AdditionRef- prefix for right-hand side operands to WITH that are not core exceptions.

FWIW, I was very much against this wart that provides no value that I can fathom, but hey! we will adapt.

sschuberth commented 9 months ago

IMHO you should report ALL AND ANY license detection issue here (and request any ORT user to do so too). Otherwise, there are no improvements possible.

No offense @pombredanne, but this issue has been open for two years (reported by you) and there were no improvements still, so it's clearly not a matter of lacking examples, but a lack of time / prioritization. (Which is ok.)

I could likely even find which version of ScanCode they used.

They use the ScanCode version that ORT uses 😉 (big wink)

We could also have a more specific way to handle exceptions and their excepted licenses as a new "detection".

Yes please. This should be solved generically instead of hard-coding a rule for this specific case. Exceptions to licenses simply never should be reported as licenses on their own, i.e. without the WITH operator.

sschuberth commented 9 months ago

L;DR, SPDX 3.0 will use the AdditionRef- prefix for right-hand side operands to WITH that are not core exceptions.

FWIW, I was very much against this wart that provides no value that I can fathom, but hey! we will adapt.

I agree. AdditionRef- is a pretty much ~stupid~ unspecific term.

pombredanne commented 9 months ago

@sschuberth re:

this issue has been open for two years (reported by you) and there were no improvements still, so it's clearly not a matter of lacking examples, but a lack of time / prioritization.

Actually the lack of prioritization has been mostly a matter of lack of examples and reported interest, until now.

pombredanne commented 9 months ago

Exceptions to licenses simply never should be reported as licenses on their own, i.e. without the WITH operator.

I am not sure this is can be done blanket, as this will certainly under or mis-report some GPLs as having an exception when they also apply without.

Ignoring this for a sec, here is a revised approach from the one listed above in https://github.com/nexB/scancode-toolkit/issues/2855#issue-1125812267 reworded this based on the current state:

  1. Use the LicenseDetection approach as a new detection rule https://github.com/nexB/scancode-toolkit/blob/f70bbb7d9d9bab40a9d504e664bc945b6a1630e8/src/licensedcode/detection.py#L116

  2. Tag all license exceptions records in the license db (such as https://scancode-licensedb.aboutcode.org/?search=exception ) with the list of license keys that they would typically except. For this, use a new attribute named exception_to that would contain a list of license keys. For instance, something along these lines:

    key: classpath-exception-2.0
    is_exception: yes
    ....
    exception_to:
    - gpl-1.0
    - gpl-1.0-plus
    - gpl-2.0
    - gpl-2.0-plus
    - gpl-3.0
    - gpl-3.0-plus
  3. When we have a detected solo exception preceded by a match to a license tag or notice of its excepted license then we could return the combined expression as a new detection from the two separate matches.

  4. We could extend this approach to a few other match sequences like license text followed by exception text.

  5. Another consideration to research: what is the resulting license category of the combined "license with exception" expression or that a of sub expression in a larger complex expression. See also https://github.com/nexB/scancode-toolkit/issues/2897

sschuberth commented 9 months ago

here is a revised approach from the one listed above

Sounds good to me.

what is the resulting license category of the combined "license with exception" expression

That's a topic also @willebra might be interested in.