aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 546 forks source link

Treat 'Directories' differently from 'Files' in the JSON output. #1979

Open MankaranSingh opened 4 years ago

MankaranSingh commented 4 years ago

Short Description

The JSON outputs treats directories and files in the same manner. Due to this, each directory type in the output has unnecessary field such as extension, size, sha1, is_binary, etc, which ultimately make the JSON file unnecessarily bigger. for example:

 "files": [
    {
      "path": "New Archive",
      "type": "directory",
      "name": "New Archive",
      "base_name": "New Archive",
      "extension": "",
      "size": 0,
      "date": null,
      "sha1": null,
      "md5": null,
      "mime_type": null,
      "file_type": null,
      "programming_language": null,
      "is_binary": false,
      "is_text": false,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 3,
      "dirs_count": 2,
      "size_count": 6554,
      "scan_errors": []
    },
    {
      "path": "New Archive/test.py",
      "type": "file",
      "name": "test.py",
      "base_name": "test",
      "extension": ".py",
      "size": 4894,
      "date": "2020-03-18",
      "sha1": "7996d4021aa514c6bb75fb1ebeca5ea97981e345",
      "md5": "30cd0b59240a03de14d92b76ada8d0c3",
      "mime_type": "text/x-python",
      "file_type": "Python script, ASCII text executable",
      "programming_language": "Python",
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": true,
      "is_script": true,
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 100.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 14,
          "matched_rule": {
            "identifier": "apache-2.0_7.RULE",
            "license_expression": "apache-2.0",
            "licenses": [
              "apache-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": true,
            "is_license_reference": false,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 85,
            "matched_length": 85,
            "match_coverage": 100.0,
            "rule_relevance": 100
          }
        }
      ],
      "license_expressions": [
        "apache-2.0"
      ],
      "copyrights": [
        {
          "value": "Copyright 2016 IBM Corp.",
          "start_line": 2,
          "end_line": 2
        }
      ],
      "holders": [
        {
          "value": "IBM Corp.",
          "start_line": 2,
          "end_line": 2
        }
      ],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [
        {
          "url": "http://www.apache.org/licenses/LICENSE-2.0",
          "start_line": 8,
          "end_line": 8
        }
      ],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2",
      "type": "directory",
      "name": "Licence 2",
      "base_name": "Licence 2",
      "extension": "",
      "size": 0,
      "date": null,
      "sha1": null,
      "md5": null,
      "mime_type": null,
      "file_type": null,
      "programming_language": null,
      "is_binary": false,
      "is_text": false,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 2,
      "dirs_count": 1,
      "size_count": 1660,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2/licence.txt",
      "type": "file",
      "name": "licence.txt",
      "base_name": "licence",
      "extension": ".txt",
      "size": 1190,
      "date": "2020-03-07",
      "sha1": "a7e1576cd85e2f8f85f2204d18627ed95c593f41",
      "md5": "d0b5bd6099693fbd565d7b979cd2842e",
      "mime_type": "text/plain",
      "file_type": "Big-endian UTF-16 Unicode text, with very long lines, with no line terminators",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2/licence 3",
      "type": "directory",
      "name": "licence 3",
      "base_name": "licence 3",
      "extension": "",
      "size": 0,
      "date": null,
      "sha1": null,
      "md5": null,
      "mime_type": null,
      "file_type": null,
      "programming_language": null,
      "is_binary": false,
      "is_text": false,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 1,
      "dirs_count": 0,
      "size_count": 470,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2/licence 3/hack1.py",
      "type": "file",
      "name": "hack1.py",
      "base_name": "hack1",
      "extension": ".py",
      "size": 470,
      "date": "2020-03-07",
      "sha1": "8122e79076d00f94b3e4101847132d2968acee42",
      "md5": "b3d286bd64003dd795906bfdeaf346bb",
      "mime_type": "text/plain",
      "file_type": "UTF-8 Unicode text, with CRLF line terminators",
      "programming_language": "Python",
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": true,
      "is_script": false,
      "licenses": [
        {
          "key": "gpl-1.0-plus",
          "score": 5.0,
          "name": "GNU General Public License 1.0 or later",
          "short_name": "GPL 1.0 or later",
          "category": "Copyleft",
          "is_exception": false,
          "owner": "Free Software Foundation (FSF)",
          "homepage_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
          "text_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:gpl-1.0-plus",
          "spdx_license_key": "GPL-1.0-or-later",
          "spdx_url": "https://spdx.org/licenses/GPL-1.0-or-later",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "gpl_bare_word_only.RULE",
            "license_expression": "gpl-1.0-plus",
            "licenses": [
              "gpl-1.0-plus"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 1,
            "matched_length": 1,
            "match_coverage": 100.0,
            "rule_relevance": 5
          }
        }
      ],
      "license_expressions": [
        "gpl-1.0-plus"
      ],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    }
  ]

It would be lot cleaner and efficient if we make two separate fields i.e. files and directories so the output would look like:

 "directories": [
    {
      "path": "New Archive",
      "files_count": 3,
      "dirs_count": 2,
      "size_count": 6554
    },
    {
      "path": "New Archive/Licence 2",
      "files_count": 2,
      "dirs_count": 1,
      "size_count": 1660
    },
    {
      "path": "New Archive/Licence 2/licence 3",
      "files_count": 1,
      "dirs_count": 0,
      "size_count": 470
    }
  ]
,
  "files": [
    {
      "path": "New Archive/test.py",
      "type": "file",
      "name": "test.py",
      "base_name": "test",
      "extension": ".py",
      "size": 4894,
      "date": "2020-03-18",
      "sha1": "7996d4021aa514c6bb75fb1ebeca5ea97981e345",
      "md5": "30cd0b59240a03de14d92b76ada8d0c3",
      "mime_type": "text/x-python",
      "file_type": "Python script, ASCII text executable",
      "programming_language": "Python",
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": true,
      "is_script": true,
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 100.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 14,
          "matched_rule": {
            "identifier": "apache-2.0_7.RULE",
            "license_expression": "apache-2.0",
            "licenses": [
              "apache-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": true,
            "is_license_reference": false,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 85,
            "matched_length": 85,
            "match_coverage": 100.0,
            "rule_relevance": 100
          }
        }
      ],
      "license_expressions": [
        "apache-2.0"
      ],
      "copyrights": [
        {
          "value": "Copyright 2016 IBM Corp.",
          "start_line": 2,
          "end_line": 2
        }
      ],
      "holders": [
        {
          "value": "IBM Corp.",
          "start_line": 2,
          "end_line": 2
        }
      ],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [
        {
          "url": "http://www.apache.org/licenses/LICENSE-2.0",
          "start_line": 8,
          "end_line": 8
        }
      ],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2/licence.txt",
      "type": "file",
      "name": "licence.txt",
      "base_name": "licence",
      "extension": ".txt",
      "size": 1190,
      "date": "2020-03-07",
      "sha1": "a7e1576cd85e2f8f85f2204d18627ed95c593f41",
      "md5": "d0b5bd6099693fbd565d7b979cd2842e",
      "mime_type": "text/plain",
      "file_type": "Big-endian UTF-16 Unicode text, with very long lines, with no line terminators",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "licenses": [],
      "license_expressions": [],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    },
    {
      "path": "New Archive/Licence 2/licence 3/hack1.py",
      "type": "file",
      "name": "hack1.py",
      "base_name": "hack1",
      "extension": ".py",
      "size": 470,
      "date": "2020-03-07",
      "sha1": "8122e79076d00f94b3e4101847132d2968acee42",
      "md5": "b3d286bd64003dd795906bfdeaf346bb",
      "mime_type": "text/plain",
      "file_type": "UTF-8 Unicode text, with CRLF line terminators",
      "programming_language": "Python",
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": true,
      "is_script": false,
      "licenses": [
        {
          "key": "gpl-1.0-plus",
          "score": 5.0,
          "name": "GNU General Public License 1.0 or later",
          "short_name": "GPL 1.0 or later",
          "category": "Copyleft",
          "is_exception": false,
          "owner": "Free Software Foundation (FSF)",
          "homepage_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
          "text_url": "http://www.gnu.org/licenses/old-licenses/gpl-1.0-standalone.html",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:gpl-1.0-plus",
          "spdx_license_key": "GPL-1.0-or-later",
          "spdx_url": "https://spdx.org/licenses/GPL-1.0-or-later",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "gpl_bare_word_only.RULE",
            "license_expression": "gpl-1.0-plus",
            "licenses": [
              "gpl-1.0-plus"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 1,
            "matched_length": 1,
            "match_coverage": 100.0,
            "rule_relevance": 5
          }
        }
      ],
      "license_expressions": [
        "gpl-1.0-plus"
      ],
      "copyrights": [],
      "holders": [],
      "authors": [],
      "packages": [],
      "emails": [],
      "urls": [],
      "files_count": 0,
      "dirs_count": 0,
      "size_count": 0,
      "scan_errors": []
    }
  ]

Possible Labels

Select Category

Describe the Update

How This Feature will help you/your organization

Possible Solution/Implementation Details

Example/Links if Any

Can you help with this Feature

Yes, i am working on a pull request for this issue.

steven-esser commented 4 years ago

@MankaranSingh Personally I do not see the value in this currently and have a feeling it may introduce bugs both in scancode and other tools. I could be wrong about this, but there is little benefit to be gained anyway. File size issues related to this disappear when you compress the file.

@pombredanne may have a different take.

MankaranSingh commented 4 years ago

hmm, it may cause issues for people already using this in production and some other tools. Although it may cause problem while scanning very large code bases.