aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.1k stars 545 forks source link

The new `from_file` field contains the name of the input directory even with `--strip-root` #3712

Open sschuberth opened 6 months ago

sschuberth commented 6 months ago

Description

I'd expect the value of the new from_file field inside the JSON's matches object to be a path relative to the input directory (if --strip-root is given). Instead, it is a relative path, but it contains the name of the input directory.

How To Reproduce

Using ScanCode 32.1.0, run a scan on the files mentioned at #3648. This gets you

  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "32.1.0",
      "options": {
        "input": [
          "/home/sebastian/Downloads/files"
        ],
        "--copyright": true,
        "--info": true,
        "--json-pp": "scancode-result.json",
        "--license": true,
        "--strip-root": true,
        "--timeout": "300.0"
      },
  "files": [
    {
      "path": "COPYING",
      "type": "file",
      "name": "COPYING",
      "base_name": "COPYING",
      "extension": "",
      "size": 2775,
      "date": "2024-01-29",
      "sha1": "66933e63e70616b43f1dc60340491f8e050eedfd",
      "md5": "97d554a32881fee0aa283d96e47cb24a",
      "sha256": "bcb02973ef6e87ea73d331b3a80df7748407f17efdb784b61b47e0e610d3bb5c",
      "mime_type": "text/plain",
      "file_type": "ASCII text",
      "programming_language": null,
      "is_binary": false,
      "is_text": true,
      "is_archive": false,
      "is_media": false,
      "is_source": false,
      "is_script": false,
      "detected_license_expression": "public-domain AND lgpl-2.1-plus AND gpl-2.0-plus AND (public-domain AND gpl-2.0-plus AND gpl-3.0-plus) AND (other-permissive AND other-copyleft) AND public-domain-disclaimer AND lgpl-2.1",
      "detected_license_expression_spdx": "LicenseRef-scancode-public-domain AND LGPL-2.1-or-later AND GPL-2.0-or-later AND (LicenseRef-scancode-public-domain AND GPL-2.0-or-later AND GPL-3.0-or-later) AND (LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft) AND LicenseRef-scancode-public-domain-disclaimer AND LGPL-2.1-only",
      "license_detections": [
        {
          "license_expression": "public-domain AND lgpl-2.1-plus AND gpl-2.0-plus AND (public-domain AND gpl-2.0-plus AND gpl-3.0-plus) AND (other-permissive AND other-copyleft) AND public-domain-disclaimer AND lgpl-2.1",
          "license_expression_spdx": "LicenseRef-scancode-public-domain AND LGPL-2.1-or-later AND GPL-2.0-or-later AND (LicenseRef-scancode-public-domain AND GPL-2.0-or-later AND GPL-3.0-or-later) AND (LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft) AND LicenseRef-scancode-public-domain-disclaimer AND LGPL-2.1-only",
          "matches": [
            {
              "license_expression": "public-domain",
              "spdx_license_expression": "LicenseRef-scancode-public-domain",
              "from_file": "files/COPYING",

So compare

"path": "COPYING",

to

"from_file": "files/COPYING",

which makes it unnecessary hard to check whether a given finding is actually just a reference or not by comparing whether both fields point to the same path.

I propose to either:

System configuration

AyanSinhaMahapatra commented 6 months ago

@sschuberth thanks for the bug report!

apply a fix for both fields to contain comparable paths,

I think this would be the cleanest approach here, since we have already considered implementing the other two options you mentioned:

  • only add from_file at all if it's pointing to a different file then path,
  • add a dedicated is_reference field to yet more easily filter out findings that are just references to other scanned files.

And after a discussion with @pombredanne decided to implement the from_file attribute the way it is today, every match is populated with the file path it originated from, regardless of it belonging to the current file or a different file.

pombredanne commented 6 months ago

@sschuberth In hindsight, I have always had reservation wrt. the --strip-root option, and I wonder if this should not be removed entirely

sschuberth commented 6 months ago

Please keep in mind that when sharing scan results it is actually convenient to have only relative paths in the JSON (to make results somewhat "relocatable"). While you probably could make them relative WRT headers.options.input, users might also feel uncomfortable with disclosing the directory structure where they store source code, as it could reveal things about the origin in the names.

So, personally I'm not against --strip-root, but then the behavior should change so as if --strip-root was always specified, and always only relative paths should be used, IMO.