Open vw-anton opened 5 months ago
@vw-anton I doubt we can detect this correctly at scale in a plain JSON file, without the --package
option, especially for MIT. MIT
being mit
in German is a very common word and not discriminant enough to be detected as-is.
Why not use the --package
option? it is designed for this purpose. And we cannot detect correctly treating a package.json as a blob of text IMHO.
Some related issues:
We are not using it in ORT due to: https://oss-review-toolkit.slack.com/archives/C9NNJ54B1/p1719903918648839
We are not using it in ORT due to: https://oss-review-toolkit.slack.com/archives/C9NNJ54B1/p1719903918648839
Let me paste this thread here for reference:
Anton (VW) 1 day ago Morning guys, I have a very strange case of a missing license finding: We ran ScanCode via ORT (22.5.0) on https://github.com/components/font-awesome/blob/f4f114c4ab37d101e6a15370769bc0af681792fa/package.json and would expect three licenses (CC-BY-4.0, MIT, OFL-1.1). However in the ORT result MIT is missing. When I run ScanCode via scancode.io all licenses are found. In ORT and in scancode.io the same ScanCode version (32.1.0) is used. Does anybody have an idea where the gap might come from? ScanCode.io result: "license_detections": [ { "license_expression": "cc-by-4.0", "license_expression_spdx": "CC-BY-4.0", "matches": [ { "license_expression": "cc-by-4.0", "spdx_license_expression": "CC-BY-4.0", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-hash", "score": 50.0, "matched_length": 4, "match_coverage": 100.0, "rule_relevance": 50, "rule_identifier": "spdx_license_id_cc-by-4.0_for_cc-by-4.0.RULE", "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_cc-by-4.0_for_cc-by-4.0.RULE", "matched_text": "CC-BY-4.0" } ], "identifier": "cc_by_4_0-415c083c-ccd1-233c-986e-75bb1ddc3fdc" }, { "license_expression": "mit", "license_expression_spdx": "MIT", "matches": [ { "license_expression": "mit", "spdx_license_expression": "MIT", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-spdx-id", "score": 100.0, "matched_length": 1, "match_coverage": 100.0, "rule_relevance": 100, "rule_identifier": "spdx-license-identifier-mit-5da48780aba670b0860c46d899ed42a0f243ff06", "rule_url": null, "matched_text": "MIT" } ], "identifier": "mit-a822f434-d61f-f2b1-c792-8b8cb9e7b9bf" }, { "license_expression": "ofl-1.1", "license_expression_spdx": "OFL-1.1", "matches": [ { "license_expression": "ofl-1.1", "spdx_license_expression": "OFL-1.1", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-hash", "score": 50.0, "matched_length": 3, "match_coverage": 100.0, "rule_relevance": 50, "rule_identifier": "spdx_license_id_ofl-1.1_for_ofl-1.1.RULE", "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_ofl-1.1_for_ofl-1.1.RULE", "matched_text": "OFL-1.1" } ], "identifier": "ofl_1_1-52c45f4c-8cce-acf4-9ef3-6682faf0c586" } vs ORT result: scanner: name: "ScanCode" version: "32.1.0" configuration: "--copyright --license --license-text --info --strip-root --timeout\ \ 600 --json-pp" summary: start_time: "2024-06-28T10:59:46.000199521Z" end_time: "2024-06-28T11:01:51.000822060Z" licenses:
- license: "CC-BY-4.0" location: path: "package.json" start_line: 10 end_line: 11
- license: "OFL-1.1" location: path: "package.json" start_line: 13 end_line: 13 score: 50.0 5 replies
sschuberth 1 day ago ORT (deliberately) does not run ScanCode with the --package option. Is the MIT finding maybe only present in the rawresult with that option?
sschuberth 1 day ago Because the start / end line of 1 is also a bit suspicious / clearly wrong.
sschuberth 1 day ago In any case, ORT should report a declared license of MIT for that package, so in total no license information is lost.
Anton (VW) 1 day ago Thanks for the hint, will check that. Would you recommend to always enable the package option?
sschuberth 1 day ago I would recommend to always disable it when using ORT, that's why that's the default :wink: One of the reasons for this is that enabling it breaks ORT's semantics to clearly distinguish between "detected" and "declared" licenses, as --package causes ScanCode to report declared licenses as detected licenses.
@vw-anton re:
clearly distinguish between "detected" and "declared" licenses, as --package causes ScanCode to report declared licenses as detected licenses.
We track these licenses at the package level
declared_license_expression
: The license expression for this package typically derived from its extracted_license_statement or from some other type-specific routine or convention.other_license_expression
:The license expression for this package which is different from the declared_license_expression, (i.e. not the primary license) routine or convention.Both are normalized licenses on which we ran ScanCode license detection, using eventually package-type-specific conventions.
We also track:
extracted_license_statement
: The license statement mention, tag or text as found in a package manifest and extracted. This can be a string, a list or dict of strings possibly nested, as found originally in the manifest.
notice_text
: A notice text for this package.
declared_license_expression
is generally consistent with SPDX definition. There is no such thing as "detected license" in SPDX and we do not track concluded license in ScanCode toolkit since as a tool it does not conclude anything.
So please consider the way we implemented to detect licenses correctly with the --package option. I am open to refinements, improvements and enhancements but you have a designed, tested and correct way to detect all these licenses right now without doing any changes.
Description
From the following file ScanCode does not extract "MIT" license when running ScanCode without --package option: https://github.com/components/font-awesome/blob/f4f114c4ab37d101e6a15370769bc0af681792fa/package.json
This is also reflected by the result of scancode.io which reports:
How To Reproduce
Run ScanCode 32.1.0 via ORT 22.5.0
System configuration