Open qduanmu opened 4 years ago
@qduanmu thank you for catching this! :bowing_man: I think the resolution is to have a list of license symbols mapped to license keys so that we can then do a proper license expression detection on this. The https://github.com/nexB/license-expression/ library supports this explicitly for this purpose. There are a few tickets to deal with this https://github.com/nexB/scancode-toolkit/issues/1523 and possibly https://github.com/nexB/scancode-toolkit/issues/707 originally and also this project I/we started at SPDX based on input from @nishakm https://github.com/spdx/package-licenses-mapping (originally from https://github.com/spdx/tools-python/issues/106 )
In all cases it will boil down to have a mapping that says:
GPLv3+
--> gpl-3.0-plus
LGPLv3+
--> lgpl-3.0-plus
.. and so onThen when we would call the code here to craft a Licensing
object, we would feed it a list of known symbols (e.g. all the SPDX, scancode and the aliases we are talking about to add here)
https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/packagedcode/licensing.py#L84
The key thing is to decide where to store that:
What do you think?
There are a few tickets to deal with this #1523 and possibly #707 originally and also this project I/we started at SPDX based on input from @nishakm https://github.com/spdx/package-licenses-mapping (originally from spdx/tools-python#106 )
Thank you for the linked information @pombredanne . Based on the discussion of spdx/tools-python#106, could I understand like this? we will use this repo https://github.com/spdx/package-licenses-mapping as the mapping source to get the license aliases in future.
Then when we would call the code here to craft a
Licensing
object, we would feed it a list of known symbols (e.g. all the SPDX, scancode and the aliases we are talking about to add here)
Maybe this is too detailed, but should we feed it here? https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/licensedcode/match_spdx_lid.py#L83
The key thing is to decide where to store that:
1. in a python script 2. more likely in the .yml data file of a license UNLESS we want these mappings to be package-type specific as proposed in #1523
If the mapping is package-type specific, I agree that license .yml file is not a good place, a special file may be a better choice and this is something related to how we run detection based on package type IMHO. Otherwise, license .yml could be an option.
@qduanmu thank you for your detailed answer!
re:
Maybe this is too detailed, but should we feed it here? https://github.com/nexB/scancode-toolkit/blob/16377111febfaef2fd0806200a43cd9f239f7d0a/src/licensedcode/match_spdx_lid.py#L83
yes that's the place.
If the mapping is package-type specific, I agree that license .yml file is not a good place, a special file may be a better choice and this is something related to how we run detection based on package type IMHO. Otherwise, license .yml could be an option.
I am split on this. @sschuberth 's suggestion was that this could be something that is not package-type specific. It works in many cases, but on the other hand some "aliases" are found only in a single type package manifest (for instance some peculiar R packages) or some multiple license expressions are quite peculiar to a manifest (for instance in R again but also in Rust). So may be we need a combo of these to cover all cases:
is_mapping
flag. We could also have more complex expressions mapped rather than just symbols ebing mapped (useful for R and Rust in particular)May be this could all be resolved with 2. only as we could cover 1. and I do not think that there are ever any case where entries in 3. would be conflicting e.g. this is no case where the string ABC:
It works in many cases, but on the other hand some "aliases" are found only in a single type package manifest (for instance some peculiar R packages) or some multiple license expressions are quite peculiar to a manifest (for instance in R again but also in Rust).
My stance on this still is that in such cases some package-manager-specific pre-processing / mapping should happen before using the result of this pre-processing as the lookup key for the generic mapping.
So may be we need a combo of these to cover all cases:
1. some mapping/aliases directly in the .yml of a License. 2. some mapping/aliases as plain new rules ideally tagged some `is_mapping` flag.
This is related to how we define the license mapping structure and whether to have the package manager stuff in it. I second the idea from @sschuberth in this comment and also this one that we should keep the mapping generic and simple.
Then in scancode, the generic license aliases would be in .yml files, for those(especially for some complex expressions) which only apply to certain package type(s), we could build license rules and apply them only to related package type detection to reduce the possible false positives in other package types.
So I am not sure mapping really work as a general purpose solution. To get a better understanding of the problem, see for instance, how we evolved in ScanCode and how the original designs with mapping files has "dissolved" naturally, in favor of a more comprehensive solution that yields better, more accurate results.
That said, this issue is not yet solved. The resolution here would likely be simply to use the built-in "mappings" we have in each licenses, where we have several alternative keys tracked including older, deprecated SPDX keys: with "other_spdx_keys" https://github.com/nexB/scancode-toolkit/blob/3be07db24f251789d5a7982cfc0b18317f0b404c/src/licensedcode/models.py#L114
Description
The license expression of 'GPLv3+ and LGPLv3+' is incorrect.
How To Reproduce
>>> from packagedcode.licensing import get_normalized_expression
>>> get_normalized_expression('GPLv3+ and LGPLv3+')
'(gpl-3.0-plus AND lgpl-3.0-plus) AND unknown'
Based on my understanding, 'GPLv3+' and 'LGPLv3+' are failed in the SPDX matching as they are not SPDX license keys(If we use their SPDX keys like 'GPL-3.0+ OR LGPL-3.0+', or 'GPL-3.0-or-later OR LGPL-3.0-or-later', the result would be correct). While in the following 'exact' matching, the ‘and’ was query tokenized which should be treated as a license expression operator instead, and it imports an 'unknown' in the final result. Finally, we join the possible multiple detected license expression with an 'AND', see: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/licensing.py#L109, this also leads to a questionable expression result. For example:
>>> get_normalized_expression('GPLv3+ or LGPLv3+')
'(gpl-3.0-plus AND lgpl-3.0-plus) AND unknown'
The package declared license is using SPDX license expression syntax, while the licenses are not SPDX license keys, and looks like this is a common case.
System configuration