aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 546 forks source link

scancode-toolkit-31.0.2 returns an unknown-license-reference just before the mit-nagy text #3078

Open DennisClark opened 2 years ago

DennisClark commented 2 years ago

I scanned doris-1.1.1-rc03 ( available at https://github.com/apache/doris/archive/refs/tags/1.1.1-rc03.tar.gz ) using scancode-toolkit-31.0.2 and although it detected most of the licenses in the rather complex notice (attached) in doris-1.1.1-rc03/be/src/glibc-compatibility/musl/COPYRIGHT it returns both unknown-license-reference and mit-nagy for this chunk of text:

The BSD PRNG implementation (src/prng/random.c) and XSI search API
(src/search/*.c) functions are Copyright © 2011 Szabolcs Nagy and
licensed under following terms: "Permission to use, copy, modify,
and/or distribute this code for any purpose with or without fee is
hereby granted. There is no warranty."

Although the mit-nagy license is returned for this (and the quoted text is an exact match) the result first returns unknown-license-reference for the third line of this paragraph:

licensed under following terms: "Permission to use, copy, modify,

See lines 47951 through 48028 in the attached scan results to see both detection instances.

Summary: It appears that the scan was misled by the single line (problem) but it then found the correct license when it looked at the entire text (good). It would of course be best if nothing were returned for the false-positive match on unknown-license-reference.

COPYRIGHT.zip

doris-1.1.1-rc03-results.json.zip

AyanSinhaMahapatra commented 2 years ago

@DennisClark this is already fixed in the LicenseDetection branch for the upcoming release: https://github.com/nexB/scancode-toolkit/tree/add-license-detection.

Similar to Issue 2 in https://github.com/nexB/scancode-toolkit/issues/3069#issuecomment-1237003830 and also similar to this issue reported by eclipse foundation here: https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1128612554, this is solved by:

Here the detection rule is "unknown-intro-followed-by-match" i.e. an unknown intro was there followed by a proper detection and so this unknown can be removed. This is achieved by tagging specific rules as is_license_intro as True.

New license detection looks like this:

      "detected_license_expression": "mit-nagy",
      "detected_license_expression_spdx": "LicenseRef-scancode-mit-nagy",
      "license_detections": [
        {
          "license_expression": "mit-nagy",
          "detection_rules": [
            "unknown-intro-followed-by-match"
          ],
          "matches": [
            {
              "score": 50.0,
              "start_line": 3,
              "end_line": 3,
              "matched_length": 2,
              "match_coverage": 100.0,
              "matcher": "2-aho",
              "license_expression": "unknown-license-reference",
              "rule_identifier": "license-intro_2.RULE",
              "referenced_filenames": [],
              "is_license_text": false,
              "is_license_notice": false,
              "is_license_reference": false,
              "is_license_tag": false,
              "is_license_intro": true,
              "rule_length": 2,
              "rule_relevance": 50,
              "matched_text": "licensed under",
              "licenses": [
                {
                  "key": "unknown-license-reference",
                  "name": "Unknown License file reference",
                  "short_name": "Unknown License reference",
                  "category": "Unstated License",
                  "is_exception": false,
                  "is_unknown": true,
                  "owner": "Unspecified",
                  "homepage_url": null,
                  "text_url": "",
                  "reference_url": "https://scancode-licensedb.aboutcode.org/unknown-license-reference",
                  "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE",
                  "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml",
                  "spdx_license_key": "LicenseRef-scancode-unknown-license-reference",
                  "spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE"
                }
              ]
            },
            {
              "score": 100.0,
              "start_line": 3,
              "end_line": 5,
              "matched_length": 24,
              "match_coverage": 100.0,
              "matcher": "2-aho",
              "license_expression": "mit-nagy",
              "rule_identifier": "mit-nagy.LICENSE",
              "referenced_filenames": [],
              "is_license_text": true,
              "is_license_notice": false,
              "is_license_reference": false,
              "is_license_tag": false,
              "is_license_intro": false,
              "rule_length": 24,
              "rule_relevance": 100,
              "matched_text": "Permission to use, copy, modify,\nand/or distribute this code for any purpose with or without fee is\nhereby granted. There is no warranty.\"",
              "licenses": [
                {
                  "key": "mit-nagy",
                  "name": "MIT Szabolcs Nagy Variant",
                  "short_name": "MIT Nagy Variant",
                  "category": "Permissive",
                  "is_exception": false,
                  "is_unknown": false,
                  "owner": "Szabolcs Nagy",
                  "homepage_url": null,
                  "text_url": "https://git.musl-libc.org/cgit/musl/commit/src/prng/random.c?id=1569f396bb76e9d54f6c4492ed6778e37b87bc70",
                  "reference_url": "https://scancode-licensedb.aboutcode.org/mit-nagy",
                  "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mit-nagy.LICENSE",
                  "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mit-nagy.yml",
                  "spdx_license_key": "LicenseRef-scancode-mit-nagy",
                  "spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/mit-nagy.LICENSE"
                }
              ]
            }
          ]
        }
      ],
      "license_clues": [],

There was also a bug related to how we group matches into LicenseDetection, I have solved this to factor in license intros when doing this grouping.

Here are the scan results for you to look at:

Old scan just this issue: doris-issue-3078.json.txt

New scan just this issue: doris-add-license-detection-issue-3078.json.txt

Old scan entire file: doris-v31.1.1-LICENSE-dist.json.txt

New scan entire file: doris-add-license-detection-LICENSE-dist.json.txt