aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 536 forks source link

Improve Ruby Package Ecosystem/Datafile Handler to tag key_files properly #3881

Open swastkk opened 1 month ago

swastkk commented 1 month ago

Description

The Ruby Package Ecosystem miss to tag the key_files properly that affects the proper attributes population at Package Level and further the license_clarity_score

Example

While scanning https://github.com/inspec/inspec/archive/refs/tags/v6.8.2.zip , got the license_clarity_score as 0 with LICENSE at inspec-bin/LICENSE and not at root is not tagged as key_file

{
      "path": "inspec-6.8.2.tar.gz-extract/inspec-6.8.2/inspec-bin/LICENSE",
      "type": "file",
      "name": "LICENSE",
      "base_name": "LICENSE",
      "extension": "",
      "size": 590,
      "date": "2024-08-02",
      "sha1": "f7fbb40d12aae4849b657cc27937e3a0f2b3dbad",
      "md5": "81b0e16be045534c5330969d1e542bb4",
      "sha256": "7f93f3fbf47c2b8129a7c1524f2fc9ed0b18e8cd0d21ab8f66dad6928ce43172",
      "mime_type": "text/plain",
      "file_type": "ASCII text",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "package_data": [],
      "for_packages": [],
      "is_legal": true,
      "is_manifest": false,
      "is_readme": false,
      "is_top_level": false,
      "is_key_file": false,
      "detected_license_expression": "apache-2.0",
      "detected_license_expression_spdx": "Apache-2.0",
      "license_detections": [
        {
          "license_expression": "apache-2.0",
          "license_expression_spdx": "Apache-2.0",
          "matches": [
            {
              "license_expression": "apache-2.0",
              "spdx_license_expression": "Apache-2.0",
              "from_file": "inspec-6.8.2.tar.gz-extract/inspec-6.8.2/inspec-bin/LICENSE",
              "start_line": 3,
              "end_line": 13,
              "matcher": "2-aho",
              "score": 100.0,
              "matched_length": 85,
              "match_coverage": 100.0,
              "rule_relevance": 100,
              "rule_identifier": "apache-2.0_7.RULE",
              "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/apache-2.0_7.RULE",
              "matched_text": "   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.",
              "matched_text_diagnostics": "Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License."
            }
          ],
          "detection_log": [],
          "identifier": "apache_2_0-c4e30bcd-ccfd-bbc3-d2f1-196ab911e47d"
        }
      ],
      "license_clues": [],
      "percentage_of_license_text": 93.41,
      "copyrights": [
        {
          "copyright": "Copyright (c) 2019 Chef Software Inc.",
          "start_line": 1,
          "end_line": 1
        }
      ],
      "holders": [
        {
          "holder": "Chef Software Inc.",
          "start_line": 1,
          "end_line": 1
        }
      ],
      "authors": [],
      "emails": [],
      "urls": [
        {
          "url": "http://www.apache.org/licenses/LICENSE-2.0",
          "start_line": 7,
          "end_line": 7
        }
      ],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },
    {

https://rubygems.org/gems/inspec-bin

Consequently the package attributes like copyright, holder, etc are not populated well and got the license_clarity_score as 0

Ripoohann commented 1 month ago

Hi I am new to contribution and would Like to work on this issue, could you please elaborate

swastkk commented 1 month ago

Hey @Ripoohann Actually this issue involves the scanning of a Monorepo that contains various Rubygem packages and as #3792 states the Package Level Summary is to be computed, and under that we are calculating the license_clarity_score and populating the various top level package attributes like copyright, holder, other_license_expression, notice_text So We are facing issue in this Monorepo and further in rubygem package ecosystem where we are not tagging the key_files properly that consequently helps in calculation of that license clarity score and package attributes that needs to be populated well. So we need to implement something in the Datafile handler that can help to tag the key_files properly.