aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 548 forks source link

Duplicates in license detection result #2170

Open qduanmu opened 4 years ago

qduanmu commented 4 years ago

The content of scanning file is: { "ZPL-2.0", new LicenseData(licenseID: "ZPL-2.0", isOsiApproved: true, isDeprecatedLicenseId: false, isFsfLibre: true) }

The license detection result is below:

    {
      "path": "test_file",
      "type": "file",
      "licenses": [
        {
          "key": "zpl-2.0",
          "score": 50.0,
          "name": "Zope Public License 2.0",
          "short_name": "ZPL 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Zope Community",
          "homepage_url": "http://www.zope.org/Resources/License/",
          "text_url": "http://www.zope.org/Resources/License/",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:zpl-2.0",
          "spdx_license_key": "ZPL-2.0",
          "spdx_url": "https://spdx.org/licenses/ZPL-2.0",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "spdx_license_id_zpl-2.0_for_zpl-2.0.RULE",
            "license_expression": "zpl-2.0",
            "licenses": [
              "zpl-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 3,
            "matched_length": 3,
            "match_coverage": 100.0,
            "rule_relevance": 50.0
          }
        },
        {
          "key": "zpl-2.0",
          "score": 50.0,
          "name": "Zope Public License 2.0",
          "short_name": "ZPL 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Zope Community",
          "homepage_url": "http://www.zope.org/Resources/License/",
          "text_url": "http://www.zope.org/Resources/License/",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:zpl-2.0",
          "spdx_license_key": "ZPL-2.0",
          "spdx_url": "https://spdx.org/licenses/ZPL-2.0",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "spdx_license_id_zpl-2.0_for_zpl-2.0.RULE",
            "license_expression": "zpl-2.0",
            "licenses": [
              "zpl-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 3,
            "matched_length": 3,
            "match_coverage": 100.0,
            "rule_relevance": 50.0
          }
        }
      ],
      "license_expressions": [
        "zpl-2.0",
        "zpl-2.0"
      ],
      "scan_errors": []
    }
pombredanne commented 4 years ago

@qduanmu Hello and hank you again! I hope everything is OK for you!

In each case the text matched is "matched_text": "ZPL-2.0\"," if you run the scan with --license --license-text --license-text-diagnostics and there are two instances so there are two detections alright. We could:

  1. create a rule with "ZPL-2.0", new LicenseData(licenseID: "ZPL-2.0" but that would be weird as they are likely many more cases like that

  2. create a false positive or a negative rule for most of the content of your file (I assume this is coming from this https://github.com/NuGet/NuGet.Client/blob/7bf0d060f3f1a680121ac17dbda01e6b15ef3b54/src/NuGet.Core/NuGet.Packaging/Licenses/NuGetLicenseData.cs ) but that would be also quite unwieldy too

  3. design something new to match these few cases of code that contains a lot of licenses that are NOT the licenses of the code such as the one you have an issue with and many other such as https://github.com/jslicense/spdx-exceptions.json/blob/master/index.json or ... for instance scancode itself.

Both 1. and 2. would be quick fixes but would not be viable for the long term. I tend to think 3. is a better but harder approach. What do you think?

hesa commented 3 years ago

Jumping in a bit late. Stumbled on a file, gen.go, yesterday. There are two license texts in the file. Scancode (3.2.3 with -clipe) reports the following for this file:

     .....
      "license_expressions": [
        "apache-2.0",
        "apache-2.0"
      ],
     .....
      "copyrights": [
        {
          "value": "Copyright 2019 The Wuffs Authors",
          "start_line": 1,
          "end_line": 1
        },
        {
          "value": "Copyright 2019 The Wuffs Authors",
          "start_line": 58,
          "end_line": 58
        }
      ],
      "holders": [
        {
          "value": "The Wuffs Authors",
          "start_line": 1,
          "end_line": 1
        },
        {
          "value": "The Wuffs Authors",
          "start_line": 58,
          "end_line": 58
        }
      ],
     ....

So, _licenseexpressions, copyrights and holders are all stated (by authors) and reported (by Scancode) twice.

I am not sure multiple and verbatim copyright and/or license statements stated multiple times should be reported as one. OK, I admit the reason I looked at this issue was because I thought it was something spooky with Scancode and I did spend some time checking my scancode report analyser for errors.

Perhaps it simply should be up to the user (machine or human) to discard duplicate entries?

pombredanne commented 3 years ago

@hesa Hey! :wave: So I think this is a good case for effectively having a simplification here. There two notices alright and scancode detects them all correctly, but a post processing would do nicely!

Unrelated: May you should run the latest version? 3.2.3 starts to be old!

hesa commented 3 years ago

I think I would prefer to do this post processing myself (i.e. let scancode report the two instances). So, for me, this issue can be closed.

Re unrelated :)

ScanCode version 21.3.31
qduanmu commented 3 years ago

Thank you for your quick response, @pombredanne , hope everything goes well with you! I didn't work on this for quite a long time(may be back in near future), so I need to have a check on the latest scancode first.

Both 1. and 2. would be quick fixes but would not be viable for the long term. I tend to think 3. is a better but harder approach. What do you think?

I second the proposal 3., design something new(like a regex pattern/rule for above files, yes, this is a hard approach for files like https://github.com/jslicense/spdx-exceptions.json/blob/master/index.json) to filter out their license matching as false positives or even skip the file scanning. I will see if I could provide some more feedback after checking the latest update.

pombredanne commented 3 years ago

@qduanmu Hey :wave: !

hope everything goes well with you!

Thank you and yes, A-OK here ... and I hope for you too. At the moment I think I went with 2. and several false positive rules were added, but that's not a satisfying solution for the ong term. At least https://raw.githubusercontent.com/jslicense/spdx-exceptions.json/master/index.json reports no license are detected.

https://raw.githubusercontent.com/NuGet/NuGet.Client/7bf0d060f3f1a680121ac17dbda01e6b15ef3b54/src/NuGet.Core/NuGet.Packaging/Licenses/NuGetLicenseData.cs is still problematic though

@AyanSinhaMahapatra ^ FUIO you may have another idea for this issue?