aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

MIT license not detected in package.json #3843

Open vw-anton opened 5 months ago

vw-anton commented 5 months ago

Description

From the following file ScanCode does not extract "MIT" license when running ScanCode without --package option: https://github.com/components/font-awesome/blob/f4f114c4ab37d101e6a15370769bc0af681792fa/package.json

    scanner:
      name: "ScanCode"
      version: "32.1.0"
      configuration: "--copyright --license --license-text --info --strip-root --timeout\
        \ 600 --json-pp"
    summary:
      start_time: "2024-06-28T10:59:46.000199521Z"
      end_time: "2024-06-28T11:01:51.000822060Z"
      licenses:    
       - license: "CC-BY-4.0"
        location:
          path: "package.json"
          start_line: 10
          end_line: 11     
      - license: "OFL-1.1"
        location:
          path: "package.json"
          start_line: 13
          end_line: 13
        score: 50.0

This is also reflected by the result of scancode.io which reports:

      "path": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json",
      "type": "file",
      "name": "package.json",

       ...

      "detected_license_expression": "cc-by-4.0 AND ofl-1.1",
      "detected_license_expression_spdx": "CC-BY-4.0 AND OFL-1.1",
      "license_detections": [
        {
          "license_expression": "cc-by-4.0 AND ofl-1.1",
          "license_expression_spdx": "CC-BY-4.0 AND OFL-1.1",
          "matches": [
            {
              "license_expression": "cc-by-4.0",
              "spdx_license_expression": "CC-BY-4.0",
              "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json",
              "start_line": 10,
              "end_line": 11,
              "matcher": "2-aho",
              "score": 100.0,
              "matched_length": 5,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "cc-by-4.0_103.RULE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/cc-by-4.0_103.RULE",
              "matched_text": "  \"license\": [\n    \"CC-BY-4.0\","
            },
            {
              "license_expression": "ofl-1.1",
              "spdx_license_expression": "OFL-1.1",
              "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json",
              "start_line": 13,
              "end_line": 13,
              "matcher": "2-aho",
              "score": 50.0,
              "matched_length": 3,
              "match_coverage": 100.0,
              "rule_relevance": 50,
              "rule_identifier": "spdx_license_id_ofl-1.1_for_ofl-1.1.RULE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_ofl-1.1_for_ofl-1.1.RULE",
              "matched_text": "    \"OFL-1.1\""
            }
          ],
          "identifier": "cc_by_4_0_and_ofl_1_1-bbdb0005-3895-360f-06e7-55f139405d2f"
        }
      ],

How To Reproduce

Run ScanCode 32.1.0 via ORT 22.5.0

System configuration

pombredanne commented 5 months ago

@vw-anton I doubt we can detect this correctly at scale in a plain JSON file, without the --package option, especially for MIT. MIT being mit in German is a very common word and not discriminant enough to be detected as-is.

Why not use the --package option? it is designed for this purpose. And we cannot detect correctly treating a package.json as a blob of text IMHO.

Some related issues:

vw-anton commented 5 months ago

We are not using it in ORT due to: https://oss-review-toolkit.slack.com/archives/C9NNJ54B1/p1719903918648839

pombredanne commented 5 months ago

We are not using it in ORT due to: https://oss-review-toolkit.slack.com/archives/C9NNJ54B1/p1719903918648839

Let me paste this thread here for reference:

Anton (VW) 1 day ago Morning guys, I have a very strange case of a missing license finding: We ran ScanCode via ORT (22.5.0) on https://github.com/components/font-awesome/blob/f4f114c4ab37d101e6a15370769bc0af681792fa/package.json and would expect three licenses (CC-BY-4.0, MIT, OFL-1.1). However in the ORT result MIT is missing. When I run ScanCode via scancode.io all licenses are found. In ORT and in scancode.io the same ScanCode version (32.1.0) is used. Does anybody have an idea where the gap might come from? ScanCode.io result: "license_detections": [ { "license_expression": "cc-by-4.0", "license_expression_spdx": "CC-BY-4.0", "matches": [ { "license_expression": "cc-by-4.0", "spdx_license_expression": "CC-BY-4.0", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-hash", "score": 50.0, "matched_length": 4, "match_coverage": 100.0, "rule_relevance": 50, "rule_identifier": "spdx_license_id_cc-by-4.0_for_cc-by-4.0.RULE", "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_cc-by-4.0_for_cc-by-4.0.RULE", "matched_text": "CC-BY-4.0" } ], "identifier": "cc_by_4_0-415c083c-ccd1-233c-986e-75bb1ddc3fdc" }, { "license_expression": "mit", "license_expression_spdx": "MIT", "matches": [ { "license_expression": "mit", "spdx_license_expression": "MIT", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-spdx-id", "score": 100.0, "matched_length": 1, "match_coverage": 100.0, "rule_relevance": 100, "rule_identifier": "spdx-license-identifier-mit-5da48780aba670b0860c46d899ed42a0f243ff06", "rule_url": null, "matched_text": "MIT" } ], "identifier": "mit-a822f434-d61f-f2b1-c792-8b8cb9e7b9bf" }, { "license_expression": "ofl-1.1", "license_expression_spdx": "OFL-1.1", "matches": [ { "license_expression": "ofl-1.1", "spdx_license_expression": "OFL-1.1", "from_file": "codebase/font-awesome-f4f114c4ab37d101e6a15370769bc0af681792fa/package.json", "start_line": 1, "end_line": 1, "matcher": "1-hash", "score": 50.0, "matched_length": 3, "match_coverage": 100.0, "rule_relevance": 50, "rule_identifier": "spdx_license_id_ofl-1.1_for_ofl-1.1.RULE", "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_ofl-1.1_for_ofl-1.1.RULE", "matched_text": "OFL-1.1" } ], "identifier": "ofl_1_1-52c45f4c-8cce-acf4-9ef3-6682faf0c586" } vs ORT result: scanner: name: "ScanCode" version: "32.1.0" configuration: "--copyright --license --license-text --info --strip-root --timeout\ \ 600 --json-pp" summary: start_time: "2024-06-28T10:59:46.000199521Z" end_time: "2024-06-28T11:01:51.000822060Z" licenses:

  • license: "CC-BY-4.0" location: path: "package.json" start_line: 10 end_line: 11
    • license: "OFL-1.1" location: path: "package.json" start_line: 13 end_line: 13 score: 50.0 5 replies

sschuberth 1 day ago ORT (deliberately) does not run ScanCode with the --package option. Is the MIT finding maybe only present in the rawresult with that option?

sschuberth 1 day ago Because the start / end line of 1 is also a bit suspicious / clearly wrong.

sschuberth 1 day ago In any case, ORT should report a declared license of MIT for that package, so in total no license information is lost.

Anton (VW) 1 day ago Thanks for the hint, will check that. Would you recommend to always enable the package option?

sschuberth 1 day ago I would recommend to always disable it when using ORT, that's why that's the default :wink: One of the reasons for this is that enabling it breaks ORT's semantics to clearly distinguish between "detected" and "declared" licenses, as --package causes ScanCode to report declared licenses as detected licenses.

pombredanne commented 5 months ago

@vw-anton re:

clearly distinguish between "detected" and "declared" licenses, as --package causes ScanCode to report declared licenses as detected licenses.

We track these licenses at the package level

Both are normalized licenses on which we ran ScanCode license detection, using eventually package-type-specific conventions.

We also track:

declared_license_expression is generally consistent with SPDX definition. There is no such thing as "detected license" in SPDX and we do not track concluded license in ScanCode toolkit since as a tool it does not conclude anything.

So please consider the way we implemented to detect licenses correctly with the --package option. I am open to refinements, improvements and enhancements but you have a designed, tested and correct way to detect all these licenses right now without doing any changes.