Open AyanSinhaMahapatra opened 8 months ago
Hey @AyanSinhaMahapatra, while i was scanning beautifulsoup4-4.12.3-py3-none-any.whl , the output i got was https://0x0.st/XPyV.whl.json and i cant find any unknown-license-references
, so can I believe that the package beautifulsoup4
has successfully found the license file which was present as
beautifulsoup4-4.12.3-py3-none-any.whl-extract/beautifulsoup4-4.12.3.dist-info/licenses/LICENSE
as we got the license detections for the same as
"license_detections": [
{
"license_expression": "mit",
"license_expression_spdx": "MIT",
"matches": [
{
"license_expression": "mit",
"spdx_license_expression": "MIT",
"from_file": "beautifulsoup4-4.12.3-py3-none-any.whl",
"start_line": 1,
"end_line": 1,
"matcher": "1-hash",
"score": 100.0,
"matched_length": 2,
"match_coverage": 100.0,
"rule_relevance": 100,
"rule_identifier": "mit_14.RULE",
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/mit_14.RULE",
"matched_text": "MIT License"
}
],
"identifier": "mit-9967e727-165e-9bb5-f090-7de5e47a3929"
},
and similarly for the beartype-0.17.2-py3-none-any.whl, we got the license detection for the package as
"license_detections": [
{
"license_expression": "mit",
"license_expression_spdx": "MIT",
"matches": [
{
"license_expression": "mit",
"spdx_license_expression": "MIT",
"from_file": "beartype-0.17.2-py3-none-any.whl-extract/beartype-0.17.2.dist-info/METADATA",
"start_line": 1,
"end_line": 1,
"matcher": "1-spdx-id",
"score": 100.0,
"matched_length": 1,
"match_coverage": 100.0,
"rule_relevance": 100,
"rule_identifier": "spdx-license-identifier-mit-5da48780aba670b0860c46d899ed42a0f243ff06",
"rule_url": null,
"matched_text": "MIT"
}
],
"identifier": "mit-a822f434-d61f-f2b1-c792-8b8cb9e7b9bf"
},
so, why do we need to have this get_key_files
as simply we had our license expression with us, as mentioned in above examples.
Correct me if I am going in wrong direction.
I'm not looking into the beautifulsoup scan as that's not mentioned in the issue, only the beartype-0.17.2.json scan results I've linked above:
See the summary other_license_expressions
:
"summary": {
"declared_license_expression": "mit",
"license_clarity_score": {
"score": 0,
"declared_license": false,
"identification_precision": false,
"has_license_text": false,
"declared_copyrights": false,
"conflicting_license_categories": false,
"ambiguous_compound_licensing": true
},
"declared_holder": "Beartype",
"primary_language": "Python",
"other_license_expressions": [
{
"value": "unknown-license-reference",
"count": 230
},
Almost all the .py
files in this wheel have a unknown-license-expression
detected in them (even though in the MANIFEST file license detections we are successfully able to resolve the unknown references)
Also see for reference this PR: https://github.com/nexB/scancode-toolkit/pull/3315
And how we have this implemented for maven at https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/maven.py#L223 . This is also used to calculate summary btw as get_field_values_from_codebase_resources
at https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/score.py#L308 uses is_key_file
and this is calculated with the help of https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/classify_plugin.py#L140
We have to look at all datafile handlers and see where we have examples of manifest files (and LICENSE files not being present at the root)
Here if we want to populate the other_license_expression
field from the other license expressions found in all package files, these unknown-license-reference
would be there. And also, we have these many unknowns detected in the file license expressions, which we need to fix also.
I got it what you said, but i am scanning a same type of python manifest betse-1.3.0-py3-none-any.whl with same header issue
#!/usr/bin/env python3
2 │ # ....................{ LICENSE }....................
3 │ # Copyright 2014-2022 by Alexis Pietak & Cecil Curry.
4 │ # See "LICENSE" for further details.
5 │
6 │ # ....................{ IMPORTS }....................
but the output of the scan does not give unknown-license-references
in the other_license_expressions
The output can be seen here https://0x0.st/XPwQ.json
or
"summary": {
"declared_license_expression": "bsd-new AND bsd-simplified",
"license_clarity_score": {
"score": 0,
"declared_license": false,
"identification_precision": false,
"has_license_text": false,
"declared_copyrights": false,
"conflicting_license_categories": false,
"ambiguous_compound_licensing": true
},
"declared_holder": "",
"primary_language": "Python",
"other_license_expressions": [
{
"value": null,
"count": 1
}
],
"other_holders": [
{
"value": null,
"count": 1
}
],
"other_languages": [
{
"value": null,
"count": 1
}
]
},
The scan dir structure looks similar to the above mentioned package beartype
.
├── betse
│ ├── __init__.py
│ ├── __main__.py
│ ├── appmeta.py
│ ├── cli
│ ├── data
│ ├── exceptions.py
│ ├── gui
│ ├── lib
│ ├── metadata.py
│ ├── metadeps.py
│ ├── science
│ └── util
└── betse-1.3.0.dist-info
├── LICENSE
├── METADATA
├── RECORD
├── WHEEL
├── entry_points.txt
└── top_level.txt
If we scan beartype-0.17.2-py3-none-any.whl we get the following output scan: beartype-0.17.2.json
This has a lot of
unknown-license-references
detected for the following reasons:The following header is present in every
.py
file:The referenced license file is present at:
beartype-0.17.2-py3-none-any.whl-extract/beartype-0.17.2.dist-info/LICENSE
which is not the root of the scan directory, but rather a ecosystem specific location that we need to look into, so we miss this LICENSE file as we only look at sibling files and files at root.We should implement something like get_key_files() for each ecosystem specific handler, and use this in the following license references logic.