aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.09k stars 541 forks source link

Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403

Open pombredanne opened 3 years ago

pombredanne commented 3 years ago

gPL and similar is a source of noisy false positive. @AyanSinhaMahapatra what's your take there?

AyanSinhaMahapatra commented 3 years ago

Phillipe, yes uppercase/lowercase would be a good way to distinguish, and also binary, so having these as OR statements in the heuristics would be great, but I'll double check all the small license rules and check if this holds just to be sure.

pombredanne commented 3 years ago

There are also things to consider beyond the case such as:

pombredanne commented 3 years ago

See also:

pombredanne commented 3 years ago

See also #797

pombredanne commented 3 years ago

@chinyeungli FYI

pombredanne commented 3 years ago

Another case: "LICENSE.gpl.\n\n3." should not be detected as gpl-3.0_rdesc_1.RULE because of the two empty lines.

pombredanne commented 2 years ago

The attached binary contains three false positive detections: false-positive-in-binaries.zip

headers:
    -   tool_name: scancode-toolkit
        tool_version: 30.0.0
        options:
            input:
                - false-positive-in-binaries.zip
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2021-10-06T212057.579137'
        end_timestamp: '2021-10-06T212059.366303'
        output_format_version: 1.0.0
        duration: '1.7871878147125244'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.14'
            files_count: 1
files:
    -   path: false-positive-in-binaries.zip
        type: file
        licenses:
            -   key: apache-2.0
                score: '95.0'
                name: Apache License 2.0
                short_name: Apache 2.0
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Apache Software Foundation
                homepage_url: http://www.apache.org/licenses/
                text_url: http://www.apache.org/licenses/LICENSE-2.0
                reference_url: https://scancode-licensedb.aboutcode.org/apache-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.yml
                spdx_license_key: Apache-2.0
                spdx_url: https://spdx.org/licenses/Apache-2.0
                start_line: 1
                end_line: 1
                matched_rule:
                    identifier: apache-2.0_388.RULE
                    license_expression: apache-2.0
                    licenses:
                        - apache-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 1
                    matched_length: 1
                    match_coverage: '100.0'
                    rule_relevance: 95
                matched_text: ALv2@
            -   key: lgpl-2.0-plus
                score: '75.0'
                name: GNU Library General Public License 2.0 or later
                short_name: LGPL 2.0 or later
                category: Copyleft Limited
                is_exception: no
                is_unknown: no
                owner: Free Software Foundation (FSF)
                homepage_url: http://www.gnu.org/licenses/old-licenses/lgpl-2.0.html
                text_url: http://www.gnu.org/licenses/old-licenses/lgpl-2.0-standalone.html
                reference_url: https://scancode-licensedb.aboutcode.org/lgpl-2.0-plus
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/lgpl-2.0-plus.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/lgpl-2.0-plus.yml
                spdx_license_key: LGPL-2.0-or-later
                spdx_url: https://spdx.org/licenses/LGPL-2.0-or-later
                start_line: 3
                end_line: 3
                matched_rule:
                    identifier: lgpl_bare_single_word.RULE
                    license_expression: lgpl-2.0-plus
                    licenses:
                        - lgpl-2.0-plus
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 1
                    matched_length: 1
                    match_coverage: '100.0'
                    rule_relevance: 75
                matched_text: lGPl~=
            -   key: gpl-2.0
                score: '50.0'
                name: GNU General Public License 2.0
                short_name: GPL 2.0
                category: Copyleft
                is_exception: no
                is_unknown: no
                owner: Free Software Foundation (FSF)
                homepage_url: http://www.gnu.org/licenses/gpl-2.0.html
                text_url: http://www.gnu.org/licenses/gpl-2.0.txt
                reference_url: https://scancode-licensedb.aboutcode.org/gpl-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/gpl-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/gpl-2.0.yml
                spdx_license_key: GPL-2.0-only
                spdx_url: https://spdx.org/licenses/GPL-2.0-only
                start_line: 4
                end_line: 4
                matched_rule:
                    identifier: gpl2_bare_word_only.RULE
                    license_expression: gpl-2.0
                    licenses:
                        - gpl-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: yes
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 1
                    matched_length: 1
                    match_coverage: '100.0'
                    rule_relevance: 50
                matched_text: GPL2\
        license_expressions:
            - apache-2.0
            - lgpl-2.0-plus
            - gpl-2.0
        percentage_of_license_text: '50.0'
        scan_errors: []