Open daniel-eder opened 4 years ago
Thank you for the report!
See #2257 as it could be a solution
Here there is a rule that detects as apache-1.1 OR apache-2.0
for this text:
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
When using the --license-diagnostics and --license-text option this becomes clearer:
{
"headers": [
{
"tool_name": "scancode-toolkit",
"tool_version": "3.2.1rc2",
"options": {
"input": [
"NOTICE.1"
],
"--json-pp": "-",
"--license": true,
"--license-text": true,
"--license-text-diagnostics": true
},
"notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
"start_timestamp": "2020-09-30T204021.645462",
"end_timestamp": "2020-09-30T204023.006937",
"duration": 1.3614952564239502,
"message": null,
"errors": [],
"extra_data": {
"files_count": 1
}
}
],
"files": [
{
"path": "NOTICE.1",
"type": "file",
"licenses": [
{
"key": "apache-2.0",
"score": 95.0,
"name": "Apache License 2.0",
"short_name": "Apache 2.0",
"category": "Permissive",
"is_exception": false,
"owner": "Apache Software Foundation",
"homepage_url": "http://www.apache.org/licenses/",
"text_url": "http://www.apache.org/licenses/LICENSE-2.0",
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
"spdx_license_key": "Apache-2.0",
"spdx_url": "https://spdx.org/licenses/Apache-2.0",
"start_line": 4,
"end_line": 5,
"matched_rule": {
"identifier": "apache_5.RULE",
"license_expression": "apache-2.0 OR apache-1.1",
"licenses": [
"apache-2.0",
"apache-1.1"
],
"is_license_text": false,
"is_license_notice": false,
"is_license_reference": true,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 14,
"matched_length": 14,
"match_coverage": 100.0,
"rule_relevance": 95.0
},
"matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
},
{
"key": "apache-1.1",
"score": 95.0,
"name": "Apache License 1.1",
"short_name": "Apache 1.1",
"category": "Permissive",
"is_exception": false,
"owner": "Apache Software Foundation",
"homepage_url": "http://www.apache.org/licenses/",
"text_url": "http://apache.org/licenses/LICENSE-1.1",
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
"spdx_license_key": "Apache-1.1",
"spdx_url": "https://spdx.org/licenses/Apache-1.1",
"start_line": 4,
"end_line": 5,
"matched_rule": {
"identifier": "{
"headers": [
{
"tool_name": "scancode-toolkit",
"tool_version": "3.2.1rc2",
"options": {
"input": [
"NOTICE.1"
],
"--json-pp": "-",
"--license": true,
"--license-text": true,
"--license-text-diagnostics": true
},
"notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
"start_timestamp": "2020-09-30T204021.645462",
"end_timestamp": "2020-09-30T204023.006937",
"duration": 1.3614952564239502,
"message": null,
"errors": [],
"extra_data": {
"files_count": 1
}
}
],
"files": [
{
"path": "NOTICE.1",
"type": "file",
"licenses": [
{
"key": "apache-2.0",
"score": 95.0,
"name": "Apache License 2.0",
"short_name": "Apache 2.0",
"category": "Permissive",
"is_exception": false,
"owner": "Apache Software Foundation",
"homepage_url": "http://www.apache.org/licenses/",
"text_url": "http://www.apache.org/licenses/LICENSE-2.0",
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
"spdx_license_key": "Apache-2.0",
"spdx_url": "https://spdx.org/licenses/Apache-2.0",
"start_line": 4,
"end_line": 5,
"matched_rule": {
"identifier": "apache_5.RULE",
"license_expression": "apache-2.0 OR apache-1.1",
"licenses": [
"apache-2.0",
"apache-1.1"
],
"is_license_text": false,
"is_license_notice": false,
"is_license_reference": true,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 14,
"matched_length": 14,
"match_coverage": 100.0,
"rule_relevance": 95.0
},
"matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
},
{
"key": "apache-1.1",
"score": 95.0,
"name": "Apache License 1.1",
"short_name": "Apache 1.1",
"category": "Permissive",
"is_exception": false,
"owner": "Apache Software Foundation",
"homepage_url": "http://www.apache.org/licenses/",
"text_url": "http://apache.org/licenses/LICENSE-1.1",
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
"spdx_license_key": "Apache-1.1",
"spdx_url": "https://spdx.org/licenses/Apache-1.1",
"start_line": 4,
"end_line": 5,
"matched_rule": {
"identifier": "apache_5.RULE",
"license_expression": "apache-2.0 OR apache-1.1",
"licenses": [
"apache-2.0",
"apache-1.1"
],
"is_license_text": false,
"is_license_notice": false,
"is_license_reference": true,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 14,
"matched_length": 14,
"match_coverage": 100.0,
"rule_relevance": 95.0
},
"matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
}
],
"license_expressions": [
"apache-2.0 OR apache-1.1"
],
"percentage_of_license_text": 46.67,
"scan_errors": []
}
]
}
",
"license_expression": "apache-2.0 OR apache-1.1",
"licenses": [
"apache-2.0",
"apache-1.1"
],
"is_license_text": false,
"is_license_notice": false,
"is_license_reference": true,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 14,
"matched_length": 14,
"match_coverage": 100.0,
"rule_relevance": 95.0
},
"matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
}
],
"license_expressions": [
"apache-2.0 OR apache-1.1"
],
"percentage_of_license_text": 46.67,
"scan_errors": []
}
]
}
See https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.yml and https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.RULE
In the end this is a notice that there is some Apache-licensed code and not really a license notice per se. This is something that should be moved to a separate "unknown" license detection option as suggested in #2257 What do you think?
I think that is a good start, although it may not remove the final problem: E.g. projects around Spring (or in general large Java Projects) often use a lot of components that are either from the Apache Foundation or follow their notice file format. That means one might be faced with hundreds of these - now apache 1.1, later "unknown" detections.
I understand that this is an oddly specific case, but might there be a way to conclude from:
... that there is indeed only the Apache-2.0 present? That would remove a quite massive manual effort when looking at larger component databases. I'm not familiar enough with your rule framework right now to estimate if this is possible and/or feasible.
Here is a chat log with @daniel-eder
@pombredanne
the license detection with scancode is fairly simple (conceptually at least) so there is no provision by default to look at anything else but one file when detecting proper... anything that would be taking into account the context (such as is there an Apache 1.1 or 2.0 detected around) would have to be a plugin in the "post scan" step (which would have full latitude to look at the neighboring context)
And that could be something where we can craft a new specific mini rule system to that effect e.g. if
then apache-2.0
alternatively we could treat this one rule as Apache-2.0 and be done with it as it will be correct in 95% of the cases
and the 5% cases where it should have been Apache-1.1 do not matter since the ASF relicensed all their Apache-1.1 to Apache-2.0
or the rule could be droped
Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case
@daniel-eder
Ok that makes sense, now I understand the scan system better I think that in the long run a post-process scan step can make sense, unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things I do think that a rule specific to this case could work out, as it's extremly unlikely that anything is affected wrongly by it
Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case
Can you explain this further? What would the output as spdx be in that case? once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?
@pombredanne
unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things
That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)
@daniel-eder
I'm currently looking at this from a perspective where ScanCode is further processed by ORT, and ideally there would be a way to end up with a way to automatically conclude "Apache-2.0" in ScanCode, without overriding each package it is found in. It sounds like the "unknown-license" approach may work for it, but I'm not sure I fully understand it
That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)
+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains
@pombredanne
Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case
Can you explain this further? What would the output as spdx be in that case? once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?
the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0 we could also entirely drop that rule... which is after all a weak license clue
@daniel-eder
the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0
Ok understood, thank you for the clarification. It would definitely be a first step towards more context in any post process step.
@pombredanne
+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains
this is a rather different take where you can script complex analysis rather than having a monolithic one-way-for-all analysis problems
For instance the first application is for the analysis of Docker images and rootfs and VM images which are rather complex https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/docker.py
@daniel-eder
we could also entirely drop that rule... which is after all a weak license clue
I guess this comes down to a philosophical question, but from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)
[...]
@pombredanne
from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)
it does not look just for ~ 1000 regex patterns like Fossology but does pair-wise diff with many text (long, short and everything in between) about ~20,000 of them.
So yes, a bona fide Apache license will be detected otherwise as well as notices and mentions
@daniel-eder
In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)
@pombredanne
In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)
I never seen that rule being detected in a context where no Apache license notices and license otherwise present in the code
So I will do this:
See also this ticket nexB/scancode-toolkit#1675 and this comment https://github.com/nexB/scancode-toolkit/issues/377#issuecomment-266032216 and this ticket nexB/scancode-toolkit#1379 that are all related to similar issues For instance: "see license in COPYING" should be able to follow what is found in COPYING :) Same for this slightly more structured case nexB/scancode-toolkit#1364
Short term I am making these return an apache-2.0 license with a relevance of 95%
Description
When scanning projects from the Apache foundation, such as log4j-core, ScanCode mistakenly detects Apache-1.1 license, in addition to the actually used Apache-2.0. The mistaken detection happens on the "notice" files that refer to the copyright holder and/or the license.
A scan with the default options
-clpeui -n 2 --json-pp <file> <directory>
from the "Getting Started" section of the documentation.How To Reproduce
scancode -clpeui -n 2 --json-pp log4j-core.json logging-log4j2-master/log4j-core
System configuration