aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 551 forks source link

Duplicate LicenseInfoInFile in SPDX #3255

Closed vargenau closed 7 months ago

vargenau commented 1 year ago

Description

Not a real bug, but why is LicenseInfoInFile duplicated for the same file (GFDL-1.1-or-later twice, GPL-2.0-or-later 3 times)?

LicenseInfoInFile: Apache-2.0
LicenseInfoInFile: CC-BY-2.0
LicenseInfoInFile: GFDL-1.1-or-later
LicenseInfoInFile: GFDL-1.1-or-later
LicenseInfoInFile: GPL-1.0-or-later
LicenseInfoInFile: GPL-2.0-or-later
LicenseInfoInFile: GPL-2.0-or-later
LicenseInfoInFile: GPL-2.0-or-later

How To Reproduce

svn checkout https://svn.code.sf.net/p/phpwiki/code/trunk phpwiki
./scancode -c -l -i --license-text --spdx-tv phpwiki.spdx phpwiki

Resulting SPDX file:

phpwiki.spdx.txt

System configuration

./scancode --version
ScanCode version: 32.0.0rc1
ScanCode Output Format version: 3.0.0
SPDX License list version: 3.19

Ubuntu 22.10

pombredanne commented 1 year ago

@vargenau this could be because the licenses are detected multiple times. Note that you can use the YAML JSON pretty-printed output with extra diagnostic and matched text details to see what issue there may be... Below is the YAML looks like this when scanning the file at https://sourceforge.net/p/phpwiki/code/HEAD/tree/trunk/configurator.php.

We have a first detection with two GPL matches at https://github.com/pombredanne/svn.code.sf.net-p-phpwiki-code/blob/master/configurator.php#L11-L25

Then we have a single GPL match at https://github.com/pombredanne/svn.code.sf.net-p-phpwiki-code/blob/master/configurator.php#L1378

And then some license comments at https://github.com/pombredanne/svn.code.sf.net-p-phpwiki-code/blob/master/configurator.php#L1388L1395 which is not detected correctly (and are in earnest not exactly clear either ... for instance https://github.com/pombredanne/svn.code.sf.net-p-phpwiki-code/blob/master/configurator.php#L1394 "Creative Commons License 2.0" does not mean much of anything)

The comments/suggestion about license could be considered a false positive.

headers:
    -   tool_name: scancode-toolkit
        tool_version: v31.2.3-379-g6358a4b81d
        options:
            input:
                - configurator.php
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2023-02-25T190554.029923'
        end_timestamp: '2023-02-25T190600.136639'
        output_format_version: 3.0.0
        duration: '6.106726169586182'
        message:
        errors: []
        warnings: []
        extra_data:
            system_environment:
                operating_system: linux
                cpu_architecture: 64
                platform: Linux-4.15.0-202-generic-x86_64-with-glibc2.23
                platform_version: '#213~16.04.1-Ubuntu SMP Wed Jan 11 10:59:04 UTC 2023'
                python_version: "3.9.10 (main, Jan 29 2022, 10:01:49) \n[GCC 5.4.0 20160609]"
            spdx_license_list_version: '3.19'
            files_count: 1
license_detections:
    -   identifier: gpl_2_0_plus-09165bba-7b1b-0ff0-bdda-dbdcb89da5e8
        license_expression: gpl-2.0-plus
        count: 1
        detection_log:
            - not-combined
        matches:
            -   score: '98.17'
                start_line: 11
                end_line: 23
                matched_length: 107
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: gpl-2.0-plus
                rule_identifier: gpl-2.0-plus_1078.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl-2.0-plus_1078.RULE
            -   score: '100.0'
                start_line: 25
                end_line: 25
                matched_length: 8
                match_coverage: '100.0'
                matcher: 1-spdx-id
                license_expression: gpl-2.0-plus
                rule_identifier: spdx-license-identifier-gpl-2.0-plus-a72d250698ecf7ac942b919f4caaaef61adb1ead
                rule_url:
    -   identifier: gpl_1_0_plus-06400413-49a2-669d-9d2d-6c6d3f5aa266
        license_expression: gpl-1.0-plus
        count: 1
        detection_log:
            - not-combined
        matches:
            -   score: '100.0'
                start_line: 1378
                end_line: 1378
                matched_length: 4
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: gpl-1.0-plus
                rule_identifier: gpl_63.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_63.RULE
    -   identifier: gpl_2_0_plus_and_gfdl_1_1_plus_and_unknown_license_reference_and_cc_by_2_0-62618cb9-6dea-9376-b51c-b7353678d45a
        license_expression: gpl-2.0-plus AND gfdl-1.1-plus AND unknown-license-reference AND
            cc-by-2.0
        count: 1
        detection_log:
            - possible-false-positive
            - not-license-clues-as-more-detections-present
        matches:
            -   score: '50.0'
                start_line: 1388
                end_line: 1393
                matched_length: 10
                match_coverage: '50.0'
                matcher: 3-seq
                license_expression: gpl-2.0-plus
                rule_identifier: gpl-2.0-plus_650.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl-2.0-plus_650.RULE
            -   score: '100.0'
                start_line: 1392
                end_line: 1392
                matched_length: 4
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: gfdl-1.1-plus
                rule_identifier: gfdl-1.1-plus_10.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gfdl-1.1-plus_10.RULE
            -   score: '100.0'
                start_line: 1393
                end_line: 1393
                matched_length: 7
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: gfdl-1.1-plus
                rule_identifier: gfdl-1.1-plus_24.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gfdl-1.1-plus_24.RULE
            -   score: '80.0'
                start_line: 1394
                end_line: 1394
                matched_length: 3
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: unknown-license-reference
                rule_identifier: unknown-license-reference_333.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/unknown-license-reference_333.RULE
            -   score: '100.0'
                start_line: 1395
                end_line: 1395
                matched_length: 7
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: cc-by-2.0
                rule_identifier: cc-by-2.0_url_glc_55.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/cc-by-2.0_url_glc_55.RULE
    -   identifier: apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8
        license_expression: apache-2.0
        count: 1
        detection_log:
            - possible-false-positive
            - not-license-clues-as-more-detections-present
        matches:
            -   score: '100.0'
                start_line: 1468
                end_line: 1468
                matched_length: 3
                match_coverage: '100.0'
                matcher: 2-aho
                license_expression: apache-2.0
                rule_identifier: spdx_license_id_apache-2.0_for_apache-2.0.RULE
                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE

files:
    -   path: configurator.php
        type: file
        detected_license_expression: gpl-2.0-plus AND gpl-1.0-plus AND (gpl-2.0-plus AND gfdl-1.1-plus
            AND unknown-license-reference AND cc-by-2.0) AND apache-2.0
        detected_license_expression_spdx: GPL-2.0-or-later AND GPL-1.0-or-later AND (GPL-2.0-or-later
            AND GFDL-1.1-or-later AND LicenseRef-scancode-unknown-license-reference AND CC-BY-2.0)
            AND Apache-2.0
        license_detections:
            -   license_expression: gpl-2.0-plus
                detection_log:
                    - not-combined
                matches:
                    -   score: '98.17'
                        start_line: 11
                        end_line: 23
                        matched_length: 107
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: gpl-2.0-plus
                        rule_identifier: gpl-2.0-plus_1078.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl-2.0-plus_1078.RULE
                        matched_text: |
                            is free software; you can redistribute it and/or modify
                             * it under the terms of the GNU General Public License as published by
                             * the Free Software Foundation; either version 2 of the License, or
                             * (at your option) any later version.
                             *
                             * [PhpWiki] is distributed in the hope that it will be useful,
                             * but WITHOUT ANY WARRANTY; without even the implied warranty of
                             * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
                             * GNU General Public License for more details.
                             *
                             * You should have received a copy of the GNU General Public License along
                             * with [PhpWiki]; if not, write to the Free Software Foundation, Inc.,
                             * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
                    -   score: '100.0'
                        start_line: 25
                        end_line: 25
                        matched_length: 8
                        match_coverage: '100.0'
                        matcher: 1-spdx-id
                        license_expression: gpl-2.0-plus
                        rule_identifier: spdx-license-identifier-gpl-2.0-plus-3f844e1a237b3ca425edf1127a3c075a0a0c1de6
                        rule_url:
                        matched_text: 'SPDX-License-Identifier: GPL-2.0-or-later'
            -   license_expression: gpl-1.0-plus
                detection_log:
                    - not-combined
                matches:
                    -   score: '100.0'
                        start_line: 1378
                        end_line: 1378
                        matched_length: 4
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: gpl-1.0-plus
                        rule_identifier: gpl_63.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_63.RULE
                        matched_text: GNU General Public License", "
            -   license_expression: gpl-2.0-plus AND gfdl-1.1-plus AND unknown-license-reference
                    AND cc-by-2.0
                detection_log:
                    - possible-false-positive
                    - not-license-clues-as-more-detections-present
                matches:
                    -   score: '50.0'
                        start_line: 1388
                        end_line: 1393
                        matched_length: 10
                        match_coverage: '50.0'
                        matcher: 3-seq
                        license_expression: gpl-2.0-plus
                        rule_identifier: gpl-2.0-plus_650.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl-2.0-plus_650.RULE
                        matched_text: |
                            https://www.gnu.org/copyleft/gpl.html#[SEC1]", "

                            [Other] [useful] [alternatives] [to] [consider]:
                            <pre>
                             [COPYRIGHTPAGE]_[TITLE] = \"GNU [Free] [Documentation] [License]\"
                             [COPYRIGHTPAGE]_[URL] = \"[https]://[www].[gnu].org/copyleft/
                    -   score: '100.0'
                        start_line: 1392
                        end_line: 1392
                        matched_length: 4
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: gfdl-1.1-plus
                        rule_identifier: gfdl-1.1-plus_10.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gfdl-1.1-plus_10.RULE
                        matched_text: GNU Free Documentation License\"
                    -   score: '100.0'
                        start_line: 1393
                        end_line: 1393
                        matched_length: 7
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: gfdl-1.1-plus
                        rule_identifier: gfdl-1.1-plus_24.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gfdl-1.1-plus_24.RULE
                        matched_text: https://www.gnu.org/copyleft/fdl.html\"
                    -   score: '80.0'
                        start_line: 1394
                        end_line: 1394
                        matched_length: 3
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: unknown-license-reference
                        rule_identifier: unknown-license-reference_333.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/unknown-license-reference_333.RULE
                        matched_text: License 2.0\"
                    -   score: '100.0'
                        start_line: 1395
                        end_line: 1395
                        matched_length: 7
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: cc-by-2.0
                        rule_identifier: cc-by-2.0_url_glc_55.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/cc-by-2.0_url_glc_55.RULE
                        matched_text: https://creativecommons.org/licenses/by/2.0/\"</
            -   license_expression: apache-2.0
                detection_log:
                    - possible-false-positive
                    - not-license-clues-as-more-detections-present
                matches:
                    -   score: '100.0'
                        start_line: 1468
                        end_line: 1468
                        matched_length: 3
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: apache-2.0
                        rule_identifier: spdx_license_id_apache-2.0_for_apache-2.0.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE
                        matched_text: Apache >= 2.0.
        license_clues: []
        percentage_of_license_text: '1.45'
        for_license_detections:
            - gpl_2_0_plus-09165bba-7b1b-0ff0-bdda-dbdcb89da5e8
            - gpl_1_0_plus-06400413-49a2-669d-9d2d-6c6d3f5aa266
            - gpl_2_0_plus_and_gfdl_1_1_plus_and_unknown_license_reference_and_cc_by_2_0-62618cb9-6dea-9376-b51c-b7353678d45a
            - apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8
        scan_errors: []
vargenau commented 1 year ago

Hi Philippe,

I agree "Creative Commons License 2.0" means nothing, I will replace it.

I understand that the license appears multiple times in the SPDX file because it was detected multiple times. But there is no added value in the SPDX file to have several identical lines. I would expect some postprocessing to remove the duplicates. This is already done for the top-level package PackageName: phpwiki where you have a list of PackageLicenseInfoFromFiles in alphabetic order without duplicates. You could do the same for each file.

As a side note, converting the SPDX file from tag:value to e.g. JSON and then back to tag:value with the online converter will remove the duplicates:

LicenseInfoInFile: Apache-2.0
LicenseInfoInFile: CC-BY-2.0
LicenseInfoInFile: GFDL-1.1-or-later
LicenseInfoInFile: GPL-1.0-or-later
LicenseInfoInFile: GPL-2.0-or-later
LicenseInfoInFile: LicenseRef-scancode-unknown-license-reference

But as already said, this is not a real bug, just a possible improvement.

vargenau commented 1 year ago

I checked the code, this should be fixed in tools-python.

https://github.com/spdx/tools-python/issues/508

vargenau commented 1 year ago

https://github.com/nexB/scancode-toolkit/issues/3289 will solve this issue.

vargenau commented 7 months ago

This is fixed in scancode-toolkit 32.1.0.

See: phpwiki-scancode-toolkit-32.1.0.spdx.txt