aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.14k stars 552 forks source link

License expression of 'GPLv3+ and LGPLv3+' is incorrect #1895

Open qduanmu opened 4 years ago

qduanmu commented 4 years ago

Description

The license expression of 'GPLv3+ and LGPLv3+' is incorrect.

How To Reproduce

>>> from packagedcode.licensing import get_normalized_expression >>> get_normalized_expression('GPLv3+ and LGPLv3+') '(gpl-3.0-plus AND lgpl-3.0-plus) AND unknown'

Based on my understanding, 'GPLv3+' and 'LGPLv3+' are failed in the SPDX matching as they are not SPDX license keys(If we use their SPDX keys like 'GPL-3.0+ OR LGPL-3.0+', or 'GPL-3.0-or-later OR LGPL-3.0-or-later', the result would be correct). While in the following 'exact' matching, the ‘and’ was query tokenized which should be treated as a license expression operator instead, and it imports an 'unknown' in the final result. Finally, we join the possible multiple detected license expression with an 'AND', see: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/licensing.py#L109, this also leads to a questionable expression result. For example: >>> get_normalized_expression('GPLv3+ or LGPLv3+') '(gpl-3.0-plus AND lgpl-3.0-plus) AND unknown'

The package declared license is using SPDX license expression syntax, while the licenses are not SPDX license keys, and looks like this is a common case.

System configuration

pombredanne commented 4 years ago

@qduanmu thank you for catching this! :bowing_man: I think the resolution is to have a list of license symbols mapped to license keys so that we can then do a proper license expression detection on this. The https://github.com/nexB/license-expression/ library supports this explicitly for this purpose. There are a few tickets to deal with this https://github.com/nexB/scancode-toolkit/issues/1523 and possibly https://github.com/nexB/scancode-toolkit/issues/707 originally and also this project I/we started at SPDX based on input from @nishakm https://github.com/spdx/package-licenses-mapping (originally from https://github.com/spdx/tools-python/issues/106 )

In all cases it will boil down to have a mapping that says:

Then when we would call the code here to craft a Licensing object, we would feed it a list of known symbols (e.g. all the SPDX, scancode and the aliases we are talking about to add here) https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/packagedcode/licensing.py#L84

The key thing is to decide where to store that:

  1. in a python script
  2. more likely in the .yml data file of a license UNLESS we want these mappings to be package-type specific as proposed in #1523

What do you think?

qduanmu commented 4 years ago

There are a few tickets to deal with this #1523 and possibly #707 originally and also this project I/we started at SPDX based on input from @nishakm https://github.com/spdx/package-licenses-mapping (originally from spdx/tools-python#106 )

Thank you for the linked information @pombredanne . Based on the discussion of spdx/tools-python#106, could I understand like this? we will use this repo https://github.com/spdx/package-licenses-mapping as the mapping source to get the license aliases in future.

Then when we would call the code here to craft a Licensing object, we would feed it a list of known symbols (e.g. all the SPDX, scancode and the aliases we are talking about to add here)

Maybe this is too detailed, but should we feed it here? https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/licensedcode/match_spdx_lid.py#L83

The key thing is to decide where to store that:

1. in a python script
2. more likely in the .yml data file of a license
   UNLESS we want these mappings to be package-type specific as proposed in #1523

If the mapping is package-type specific, I agree that license .yml file is not a good place, a special file may be a better choice and this is something related to how we run detection based on package type IMHO. Otherwise, license .yml could be an option.

pombredanne commented 4 years ago

@qduanmu thank you for your detailed answer!

re:

Maybe this is too detailed, but should we feed it here? https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/licensedcode/match_spdx_lid.py#L83

yes that's the place.

If the mapping is package-type specific, I agree that license .yml file is not a good place, a special file may be a better choice and this is something related to how we run detection based on package type IMHO. Otherwise, license .yml could be an option.

I am split on this. @sschuberth 's suggestion was that this could be something that is not package-type specific. It works in many cases, but on the other hand some "aliases" are found only in a single type package manifest (for instance some peculiar R packages) or some multiple license expressions are quite peculiar to a manifest (for instance in R again but also in Rust). So may be we need a combo of these to cover all cases:

  1. some mapping/aliases directly in the .yml of a License.
  2. some mapping/aliases as plain new rules ideally tagged some is_mapping flag. We could also have more complex expressions mapped rather than just symbols ebing mapped (useful for R and Rust in particular)
  3. some mapping/aliases specific to a Package type and stored in Python code together with the Package subclass they are for.

May be this could all be resolved with 2. only as we could cover 1. and I do not think that there are ever any case where entries in 3. would be conflicting e.g. this is no case where the string ABC:

sschuberth commented 4 years ago

It works in many cases, but on the other hand some "aliases" are found only in a single type package manifest (for instance some peculiar R packages) or some multiple license expressions are quite peculiar to a manifest (for instance in R again but also in Rust).

My stance on this still is that in such cases some package-manager-specific pre-processing / mapping should happen before using the result of this pre-processing as the lookup key for the generic mapping.

qduanmu commented 4 years ago

So may be we need a combo of these to cover all cases:

1. some mapping/aliases directly in the .yml of a License.
2. some mapping/aliases as plain new rules ideally tagged some `is_mapping` flag.

This is related to how we define the license mapping structure and whether to have the package manager stuff in it. I second the idea from @sschuberth in this comment and also this one that we should keep the mapping generic and simple.

Then in scancode, the generic license aliases would be in .yml files, for those(especially for some complex expressions) which only apply to certain package type(s), we could build license rules and apply them only to related package type detection to reduce the possible false positives in other package types.

pombredanne commented 3 years ago

So I am not sure mapping really work as a general purpose solution. To get a better understanding of the problem, see for instance, how we evolved in ScanCode and how the original designs with mapping files has "dissolved" naturally, in favor of a more comprehensive solution that yields better, more accurate results.

That said, this issue is not yet solved. The resolution here would likely be simply to use the built-in "mappings" we have in each licenses, where we have several alternative keys tracked including older, deprecated SPDX keys: with "other_spdx_keys" https://github.com/nexB/scancode-toolkit/blob/3be07db24f251789d5a7982cfc0b18317f0b404c/src/licensedcode/models.py#L114