aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Look in package-ecosystem specific key-files for referenced licenses #3707

Open AyanSinhaMahapatra opened 8 months ago

AyanSinhaMahapatra commented 8 months ago

If we scan beartype-0.17.2-py3-none-any.whl we get the following output scan: beartype-0.17.2.json

This has a lot of unknown-license-references detected for the following reasons:

The following header is present in every .py file:

# --------------------( LICENSE                            )--------------------
# Copyright (c) 2014-2024 Beartype authors.
# See "LICENSE" for further details.

The referenced license file is present at: beartype-0.17.2-py3-none-any.whl-extract/beartype-0.17.2.dist-info/LICENSE which is not the root of the scan directory, but rather a ecosystem specific location that we need to look into, so we miss this LICENSE file as we only look at sibling files and files at root.

We should implement something like get_key_files() for each ecosystem specific handler, and use this in the following license references logic.

swastkk commented 6 months ago

Hey @AyanSinhaMahapatra, while i was scanning beautifulsoup4-4.12.3-py3-none-any.whl , the output i got was https://0x0.st/XPyV.whl.json and i cant find any unknown-license-references , so can I believe that the package beautifulsoup4 has successfully found the license file which was present as beautifulsoup4-4.12.3-py3-none-any.whl-extract/beautifulsoup4-4.12.3.dist-info/licenses/LICENSE as we got the license detections for the same as

"license_detections": [
        {
          "license_expression": "mit",
          "license_expression_spdx": "MIT",
          "matches": [
            {
              "license_expression": "mit",
              "spdx_license_expression": "MIT",
              "from_file": "beautifulsoup4-4.12.3-py3-none-any.whl",
              "start_line": 1,
              "end_line": 1,
              "matcher": "1-hash",
              "score": 100.0,
              "matched_length": 2,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "mit_14.RULE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/mit_14.RULE",
              "matched_text": "MIT License"
            }
          ],
          "identifier": "mit-9967e727-165e-9bb5-f090-7de5e47a3929"
        },

and similarly for the beartype-0.17.2-py3-none-any.whl, we got the license detection for the package as

"license_detections": [
        {
          "license_expression": "mit",
          "license_expression_spdx": "MIT",
          "matches": [
            {
              "license_expression": "mit",
              "spdx_license_expression": "MIT",
              "from_file": "beartype-0.17.2-py3-none-any.whl-extract/beartype-0.17.2.dist-info/METADATA",
              "start_line": 1,
              "end_line": 1,
              "matcher": "1-spdx-id",
              "score": 100.0,
              "matched_length": 1,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "spdx-license-identifier-mit-5da48780aba670b0860c46d899ed42a0f243ff06",
              "rule_url": null,
              "matched_text": "MIT"
            }
          ],
          "identifier": "mit-a822f434-d61f-f2b1-c792-8b8cb9e7b9bf"
        },

so, why do we need to have this get_key_files as simply we had our license expression with us, as mentioned in above examples. Correct me if I am going in wrong direction.

AyanSinhaMahapatra commented 6 months ago

I'm not looking into the beautifulsoup scan as that's not mentioned in the issue, only the beartype-0.17.2.json scan results I've linked above:

See the summary other_license_expressions:

"summary": {
    "declared_license_expression": "mit",
    "license_clarity_score": {
      "score": 0,
      "declared_license": false,
      "identification_precision": false,
      "has_license_text": false,
      "declared_copyrights": false,
      "conflicting_license_categories": false,
      "ambiguous_compound_licensing": true
    },
    "declared_holder": "Beartype",
    "primary_language": "Python",
    "other_license_expressions": [
      {
        "value": "unknown-license-reference",
        "count": 230
      },

Almost all the .py files in this wheel have a unknown-license-expression detected in them (even though in the MANIFEST file license detections we are successfully able to resolve the unknown references)

Also see for reference this PR: https://github.com/nexB/scancode-toolkit/pull/3315 And how we have this implemented for maven at https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/maven.py#L223 . This is also used to calculate summary btw as get_field_values_from_codebase_resources at https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/score.py#L308 uses is_key_file and this is calculated with the help of https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/classify_plugin.py#L140

We have to look at all datafile handlers and see where we have examples of manifest files (and LICENSE files not being present at the root)

Here if we want to populate the other_license_expression field from the other license expressions found in all package files, these unknown-license-reference would be there. And also, we have these many unknowns detected in the file license expressions, which we need to fix also.

swastkk commented 6 months ago

I got it what you said, but i am scanning a same type of python manifest betse-1.3.0-py3-none-any.whl with same header issue

#!/usr/bin/env python3
   2   │ # ....................{ LICENSE                           }....................
   3   │ # Copyright 2014-2022 by Alexis Pietak & Cecil Curry.
   4   │ # See "LICENSE" for further details.
   5   │
   6   │ # ....................{ IMPORTS                           }....................

but the output of the scan does not give unknown-license-references in the other_license_expressions The output can be seen here https://0x0.st/XPwQ.json or

"summary": {
    "declared_license_expression": "bsd-new AND bsd-simplified",
    "license_clarity_score": {
      "score": 0,
      "declared_license": false,
      "identification_precision": false,
      "has_license_text": false,
      "declared_copyrights": false,
      "conflicting_license_categories": false,
      "ambiguous_compound_licensing": true
    },
    "declared_holder": "",
    "primary_language": "Python",
    "other_license_expressions": [
      {
        "value": null,
        "count": 1
      }
    ],
    "other_holders": [
      {
        "value": null,
        "count": 1
      }
    ],
    "other_languages": [
      {
        "value": null,
        "count": 1
      }
    ]
  },

The scan dir structure looks similar to the above mentioned package beartype

.
├── betse
│   ├── __init__.py
│   ├── __main__.py
│   ├── appmeta.py
│   ├── cli
│   ├── data
│   ├── exceptions.py
│   ├── gui
│   ├── lib
│   ├── metadata.py
│   ├── metadeps.py
│   ├── science
│   └── util
└── betse-1.3.0.dist-info
    ├── LICENSE
    ├── METADATA
    ├── RECORD
    ├── WHEEL
    ├── entry_points.txt
    └── top_level.txt