aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Code wrongly detected as gpl-1.0 #2371

Open xu1119 opened 3 years ago

xu1119 commented 3 years ago

When trying to scan this file with latest scancode, It get the following license : file from https://github.com/zyq8709/DexHunter/blob/master/dalvik/vm/compiler/codegen/x86/AnalysisO1.cpp

{
          "key": "gpl-1.0",
          "score": 100.0,
          "name": "GNU General Public License 1.0",
          "short_name": "GPL 1.0",
          "category": "Copyleft",
          "is_exception": false,
          "owner": "Free Software Foundation (FSF)",
          "homepage_url": "http://www.gnu.org/licenses/gpl-1.0.html",
          "text_url": "http://www.gnu.org/licenses/gpl-1.0.txt",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:gpl-1.0",
          "spdx_license_key": "GPL-1.0-only",
          "spdx_url": "https://spdx.org/licenses/GPL-1.0-only",
          "start_line": 1606,
          "end_line": 1606,
          "matched_rule": {
            "identifier": "gpl-1.0_15.RULE",
            "license_expression": "gpl-1.0",
            "licenses": [
              "gpl-1.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": false,
            "is_license_tag": true,
            "matcher": "2-aho",
            "rule_length": 2,
            "matched_length": 2,
            "match_coverage": 100.0,
            "rule_relevance": 100.0
          },
          "matched_text": "            currentBB->xferPoints[currentBB->num_xfer_points].vr_gpl = -1;"
        },

Description

Source code wrongly detected as gpl-1.0

How To Reproduce

scancode -li --license-text --json-pp - AnalysisO1.cpp

System configuration

pombredanne commented 3 years ago

@xu1119 Thanks! @AyanSinhaMahapatra would your new plugin be able to spot this?

AyanSinhaMahapatra commented 3 years ago

@pombredanne No (and Yes). But it should have definitely, so this was a good find.

  1. So in, most of the false positives I got before, the common factor was that rule_length was 1, as in it got matched to a very simple rule having just the name of the license, like just gpl. But this got matched to gpl-1.0_15.RULE for which the text is gpl 1. So now the preliminary step to separate probable false-positives was, "is_license_tag" == true and "rule_length" == 1 as here, and then run it through a classifier to determine that more accurately. We definitely need to set in place a more explicit step, by going through all the scancode license_tag rules, and see which ones have the potential to be matched to become a false_positive and then either increase these "rule_length" criteria for these cases to be correctly analyzed too or even maintain a set of rules which can generate potential false positives, adding a ticket now and doing the same.

  2. The sentence classifier step, i.e. the false_positive vs license_tag NLP classifier does correctly detect this. So, that works. The prelim step to only take out matches with "rule_length" == 1 was done because the assumption was, false positives are generated from only these rules, so we don't have to pass all the license_tag matches through the classifier. But there's clearly exceptions to this assumption, like this case here, and we should be able to detect that.

Thanks @xu1119

AyanSinhaMahapatra commented 3 years ago

Also @pombredanne there's a ticket open for an extra heuristic you suggested, here at nexB/scancode-results-analyzer#29, implementing this (without the single-word, making things more explicit here as discussed above) also would be able to detect this, since the "start_line": 1606.