aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Scan detects Apache-1.1 instead of/in addition to Apache-2.0 in notice files by Apache foundation. #2266

Open daniel-eder opened 4 years ago

daniel-eder commented 4 years ago

Description

When scanning projects from the Apache foundation, such as log4j-core, ScanCode mistakenly detects Apache-1.1 license, in addition to the actually used Apache-2.0. The mistaken detection happens on the "notice" files that refer to the copyright holder and/or the license.

A scan with the default options -clpeui -n 2 --json-pp <file> <directory> from the "Getting Started" section of the documentation.

How To Reproduce

  1. Download the source code for log4j-core (or the full log4j, or any other apache foundation project)
  2. Run ScanCode Toolkit with the default options from the Getting Started Section: scancode -clpeui -n 2 --json-pp log4j-core.json logging-log4j2-master/log4j-core
  3. The "notice" file will report both Apache-2.0 and Apache-1.1, see log4j-core-result.zip

System configuration

pombredanne commented 4 years ago

Thank you for the report! See #2257 as it could be a solution Here there is a rule that detects as apache-1.1 OR apache-2.0 for this text:

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

When using the --license-diagnostics and --license-text option this becomes clearer:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.2.1rc2",
      "options": {
        "input": [
          "NOTICE.1"
        ],
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2020-09-30T204021.645462",
      "end_timestamp": "2020-09-30T204023.006937",
      "duration": 1.3614952564239502,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "NOTICE.1",
      "type": "file",
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 95.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        },
        {
          "key": "apache-1.1",
          "score": 95.0,
          "name": "Apache License 1.1",
          "short_name": "Apache 1.1",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://apache.org/licenses/LICENSE-1.1",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
          "spdx_license_key": "Apache-1.1",
          "spdx_url": "https://spdx.org/licenses/Apache-1.1",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.2.1rc2",
      "options": {
        "input": [
          "NOTICE.1"
        ],
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2020-09-30T204021.645462",
      "end_timestamp": "2020-09-30T204023.006937",
      "duration": 1.3614952564239502,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "NOTICE.1",
      "type": "file",
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 95.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        },
        {
          "key": "apache-1.1",
          "score": 95.0,
          "name": "Apache License 1.1",
          "short_name": "Apache 1.1",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://apache.org/licenses/LICENSE-1.1",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
          "spdx_license_key": "Apache-1.1",
          "spdx_url": "https://spdx.org/licenses/Apache-1.1",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        }
      ],
      "license_expressions": [
        "apache-2.0 OR apache-1.1"
      ],
      "percentage_of_license_text": 46.67,
      "scan_errors": []
    }
  ]
}
",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        }
      ],
      "license_expressions": [
        "apache-2.0 OR apache-1.1"
      ],
      "percentage_of_license_text": 46.67,
      "scan_errors": []
    }
  ]
}

See https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.yml and https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.RULE

In the end this is a notice that there is some Apache-licensed code and not really a license notice per se. This is something that should be moved to a separate "unknown" license detection option as suggested in #2257 What do you think?

daniel-eder commented 4 years ago

I think that is a good start, although it may not remove the final problem: E.g. projects around Spring (or in general large Java Projects) often use a lot of components that are either from the Apache Foundation or follow their notice file format. That means one might be faced with hundreds of these - now apache 1.1, later "unknown" detections.

I understand that this is an oddly specific case, but might there be a way to conclude from:

... that there is indeed only the Apache-2.0 present? That would remove a quite massive manual effort when looking at larger component databases. I'm not familiar enough with your rule framework right now to estimate if this is possible and/or feasible.

pombredanne commented 4 years ago

Here is a chat log with @daniel-eder

@pombredanne

the license detection with scancode is fairly simple (conceptually at least) so there is no provision by default to look at anything else but one file when detecting proper... anything that would be taking into account the context (such as is there an Apache 1.1 or 2.0 detected around) would have to be a plugin in the "post scan" step (which would have full latitude to look at the neighboring context)

And that could be something where we can craft a new specific mini rule system to that effect e.g. if

@daniel-eder

Ok that makes sense, now I understand the scan system better I think that in the long run a post-process scan step can make sense, unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things I do think that a rule specific to this case could work out, as it's extremly unlikely that anything is affected wrongly by it

Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case

Can you explain this further? What would the output as spdx be in that case? once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?

@pombredanne

unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things

That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)

@daniel-eder

I'm currently looking at this from a perspective where ScanCode is further processed by ORT, and ideally there would be a way to end up with a way to automatically conclude "Apache-2.0" in ScanCode, without overriding each package it is found in. It sounds like the "unknown-license" approach may work for it, but I'm not sure I fully understand it

That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)

+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains

@pombredanne

Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case

Can you explain this further? What would the output as spdx be in that case? once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?

the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0 we could also entirely drop that rule... which is after all a weak license clue

@daniel-eder

the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0

Ok understood, thank you for the clarification. It would definitely be a first step towards more context in any post process step.

@pombredanne

+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains

this is a rather different take where you can script complex analysis rather than having a monolithic one-way-for-all analysis problems

For instance the first application is for the analysis of Docker images and rootfs and VM images which are rather complex https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/docker.py

@daniel-eder

we could also entirely drop that rule... which is after all a weak license clue

I guess this comes down to a philosophical question, but from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)

[...]

@pombredanne

from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)

it does not look just for ~ 1000 regex patterns like Fossology but does pair-wise diff with many text (long, short and everything in between) about ~20,000 of them.

So yes, a bona fide Apache license will be detected otherwise as well as notices and mentions

@daniel-eder

In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)

@pombredanne

In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)

I never seen that rule being detected in a context where no Apache license notices and license otherwise present in the code

So I will do this:

See also this ticket nexB/scancode-toolkit#1675 and this comment https://github.com/nexB/scancode-toolkit/issues/377#issuecomment-266032216 and this ticket nexB/scancode-toolkit#1379 that are all related to similar issues For instance: "see license in COPYING" should be able to follow what is found in COPYING :) Same for this slightly more structured case nexB/scancode-toolkit#1364

pombredanne commented 4 years ago

Short term I am making these return an apache-2.0 license with a relevance of 95%