aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 548 forks source link

Wrong license detection in `xmlunit` #3211

Open bennati opened 1 year ago

bennati commented 1 year ago

Description

Scanning https://github.com/xmlunit/xmlunit leads to false positives. These wrong detections are also influenced by minimal changes in the files, which should not have an impact.

How To Reproduce

  1. Clone xmlunit from https://github.com/xmlunit/xmlunit
  2. Scan xmlunit with scancode ./scancode -l --json-pp ./result .../xmlunit
  3. Open result and search for xmlunit-legacy/pom.xml
  4. Search the detection that corresponds to lines 35 to 41
  5. Note that the detected license is apache-2.0
  6. Look at the file and verify that the correct detection should be BSD3
  7. Check out xmlunit v2.6.4 git checkout v2.6.4
  8. Scan xmlunit with scancode ./scancode -l --json-pp ./result .../xmlunit
  9. Open result and search for xmlunit-legacy/pom.xml
  10. Search the detection that corresponds to lines 35 to 41
  11. Note that the detected license is jsr-107-jcache-spec-2013
  12. Verify that the only change in xmlunit-legacy/pom.xml is in the URL
  13. Verify that the file referenced in the URL hasn't changed at all git diff main xmlunit-legacy/LICENSE.txt

System configuration

For bug reports, it really helps us to know:

pombredanne commented 1 year ago

As a recap:

  1. We have an inaccurate detection to fix with a new rule
  2. You should use the --package option to get correct and complete licene detection

In details:

When scanning https://raw.githubusercontent.com/xmlunit/xmlunit/04aa11879d86135c37f5af8fd5694bf08d08972d/xmlunit-legacy/pom.xml as a plain text file, we get this:

headers:
    -   tool_name: scancode-toolkit
        tool_version: v31.2.3-372-g18a842e769
        options:
            input:
                - /home/pombreda/tmp/xmlunit-04aa11/xmlunit-legacy/pom.xml
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
[..........]

files:
    -   path: pom.xml
        type: file
        detected_license_expression: apache-2.0 AND (apache-2.0 AND bsd-new)
        detected_license_expression_spdx: Apache-2.0 AND (Apache-2.0 AND BSD-3-Clause)
        license_detections:
            -   license_expression: apache-2.0
                detection_log:
                    - not-combined
                matches:
                    -   score: '97.7'
                        start_line: 3
                        end_line: 13
                        matched_length: 85
                        match_coverage: '100.0'
                        matcher: 3-seq
                        license_expression: apache-2.0
                        rule_identifier: apache-2.0_7.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/apache-2.0_7.RULE
                        matched_text: |
                            licensed [to] [You] under the Apache License, Version 2.0
                              (the "License"); you may not use this file except in compliance with
                              the License.  You may obtain a copy of the License at

                              http://www.apache.org/licenses/LICENSE-2.0

                              Unless required by applicable law or agreed to in writing, software
                              distributed under the License is distributed on an "AS IS" BASIS,
                              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
                              See the License for the specific language governing permissions and
                              limitations under the License.
            -   license_expression: apache-2.0 AND bsd-new
                detection_log:
                    - not-combined
                matches:
                    -   score: '50.0'
                        start_line: 35
                        end_line: 41
                        matched_length: 11
                        match_coverage: '50.0'
                        matcher: 3-seq
                        license_expression: apache-2.0
                        rule_identifier: apache-2.0_839.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/apache-2.0_839.RULE
                        matched_text: |
                            licenses>
                                <license>
                                  <name>[The] [BSD] [3]-[Clause] License</[name]>
                                  <[url]>[https]://[github].[com]/[xmlunit]/[xmlunit]/[blob]/[main]/[xmlunit]-[legacy]/[LICENSE].txt</url>
                                  <distribution>repo</distribution>
                                </license>
                              </licenses>
                    -   score: '100.0'
                        start_line: 37
                        end_line: 37
                        matched_length: 5
                        match_coverage: '100.0'
                        matcher: 2-aho
                        license_expression: bsd-new
                        rule_identifier: bsd-new_364.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_364.RULE
                        matched_text: The BSD 3-Clause License</
        license_clues: []
        percentage_of_license_text: '30.12'
        for_license_detections:
            - apache_2_0-fc261552-78a1-c631-caec-25ea90dee31f
            - apache_2_0_and_bsd_new-5fa1f720-156d-ec9a-965f-f4b1a5afdf86
        scan_errors: []

and this is a bug alright as the second match is incorrect The Apache is incorrectly detected in this (see the parts in brackets that are not detected):

                        matched_text: |
                            licenses>
                                <license>
                                  <name>[The] [BSD] [3]-[Clause] License</[name]>
                                  <[url]>[https]://[github].[com]/[xmlunit]/[xmlunit]/[blob]/[main]/[xmlunit]-[legacy]/[LICENSE].txt</url>
                                  <distribution>repo</distribution>
                                </license>
                              </licenses>

but even then, the detection ends up correctly reported at the file level:

detected_license_expression: apache-2.0 AND (apache-2.0 AND bsd-new)
detected_license_expression_spdx: Apache-2.0 AND (Apache-2.0 AND BSD-3-Clause)

When scanning as a package with --package I get this:

headers:
    -   tool_name: scancode-toolkit
        tool_version: v31.2.3-372-g18a842e769
        options:
            input:
                - /home/pombreda/tmp/xmlunit-04aa11/xmlunit-legacy/pom.xml
            --package: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2023-01-20T101831.060085'
        end_timestamp: '2023-01-20T101834.719757'
        output_format_version: 3.0.0
        duration: '3.659682273864746'
        message:
        errors: []
        warnings: []
        extra_data:
            system_environment:
                operating_system: linux
                cpu_architecture: 64
                platform: Linux-4.15.0-200-generic-x86_64-with-glibc2.23
                platform_version: '#211~16.04.2-Ubuntu SMP Fri Nov 25 09:18:48 UTC 2022'
                python_version: "3.9.10 (main, Jan 29 2022, 10:01:49) \n[GCC 5.4.0 20160609]"
            spdx_license_list_version: '3.19'
            files_count: 1
packages:
    -   type: maven
        namespace: org.xmlunit
        name: xmlunit-legacy
        version: 2.9.2-SANPSHOT
        qualifiers: {}
        subpath:
        primary_language: Java
        description: |
            org.xmlunit:xmlunit-legacy
            XMLUnit 1.x Compatibility Layer
        release_date:
        parties: []
        keywords: []
        homepage_url: https://www.xmlunit.org/
        download_url:
        size:
        sha1:
        md5:
        sha256:
        sha512:
        bug_tracking_url:
        code_view_url:
        vcs_url:
        copyright:
        declared_license_expression: bsd-new
        declared_license_expression_spdx: BSD-3-Clause
        license_detections:
            -   license_expression: bsd-new
                detection_log:
                    - not-combined
                matches:
                    -   score: '100.0'
                        start_line: 1
                        end_line: 1
                        matched_length: 5
                        match_coverage: '100.0'
                        matcher: 1-hash
                        license_expression: bsd-new
                        rule_identifier: bsd-new_364.RULE
                        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_364.RULE
                        matched_text: The BSD 3-Clause License
        other_license_expression:
        other_license_expression_spdx:
        other_license_detections: []
        extracted_license_statement: '[{''name'': ''The BSD 3-Clause License'', ''url'': ''https://github.com/xmlunit/xmlunit/blob/main/xmlunit-legacy/LICENSE.txt'',
            ''comments'': None, ''distribution'': ''repo''}]'
        notice_text:
        source_packages:
            - pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?classifier=sources
        extra_data: {}
        repository_homepage_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/
        repository_download_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/xmlunit-legacy-2.9.2-SANPSHOT.jar
        api_data_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/xmlunit-legacy-2.9.2-SANPSHOT.pom
        package_uid: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?uuid=e54b35fe-76ae-4597-8b69-875dd8a9afd9
        datafile_paths:
            - pom.xml
        datasource_ids:
            - maven_pom
        purl: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT
dependencies:
    -   purl: pkg:maven/org.xmlunit/xmlunit-core
        extracted_requirement:
        scope: compile
        is_runtime: no
        is_optional: yes
        is_resolved: no
        resolved_package: {}
        extra_data: {}
        dependency_uid: pkg:maven/org.xmlunit/xmlunit-core?uuid=8682618d-c265-4d7c-8cd2-1feb5488f859
        for_package_uid: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?uuid=e54b35fe-76ae-4597-8b69-875dd8a9afd9
        datafile_path: pom.xml
        datasource_id: maven_pom
    -   purl: pkg:maven/junit/junit@3.8.1
        extracted_requirement: 3.8.1
        scope: compile
        is_runtime: no
        is_optional: yes
        is_resolved: yes
        resolved_package: {}
        extra_data: {}
        dependency_uid: pkg:maven/junit/junit@3.8.1?uuid=3cae6762-e0a4-451a-b1d5-365f8f511d60
        for_package_uid: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?uuid=e54b35fe-76ae-4597-8b69-875dd8a9afd9
        datafile_path: pom.xml
        datasource_id: maven_pom
    -   purl: pkg:maven/org.mockito/mockito-core
        extracted_requirement:
        scope: test
        is_runtime: no
        is_optional: yes
        is_resolved: no
        resolved_package: {}
        extra_data: {}
        dependency_uid: pkg:maven/org.mockito/mockito-core?uuid=10211b7c-b7f6-4621-845e-11111c16e38c
        for_package_uid: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?uuid=e54b35fe-76ae-4597-8b69-875dd8a9afd9
        datafile_path: pom.xml
        datasource_id: maven_pom
license_references:
    -   key: bsd-new
        language: en
        short_name: BSD-3-Clause
        name: BSD-3-Clause
        category: Permissive
        owner: Regents of the University of California
        homepage_url: http://www.opensource.org/licenses/BSD-3-Clause
        notes: Per SPDX.org, this license is OSI certified.
        is_builtin: yes
        is_exception: no
        is_unknown: no
        is_generic: no
        spdx_license_key: BSD-3-Clause
        other_spdx_license_keys:
            - LicenseRef-scancode-libzip
        osi_license_key: BSD-3-Clause
        text_urls:
            - http://www.opensource.org/licenses/BSD-3-Clause
        osi_url: http://www.opensource.org/licenses/BSD-3-Clause
        faq_url:
        other_urls:
            - http://framework.zend.com/license/new-bsd
            - https://opensource.org/licenses/BSD-3-Clause
            - https://www.eclipse.org/org/documents/edl-v10.php
        key_aliases: []
        minimum_coverage: '0'
        standard_notice:
        ignorable_copyrights: []
        ignorable_holders: []
        ignorable_authors: []
        ignorable_urls: []
        ignorable_emails: []
        text: |
            Redistribution and use in source and binary forms, with or without modification,
            are permitted provided that the following conditions are met:

            Redistributions of source code must retain the above copyright notice, this list
            of conditions and the following disclaimer.

            Redistributions in binary form must reproduce the above copyright notice, this
            list of conditions and the following disclaimer in the documentation and/or
            other materials provided with the distribution.

            Neither the name of the ORGANIZATION nor the names of its contributors may be
            used to endorse or promote products derived from this software without specific
            prior written permission.

            THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
            "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
            THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
            ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS
            BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
            CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
            GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
            HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
            LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
            THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
        scancode_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-new.LICENSE
        licensedb_url: https://scancode-licensedb.aboutcode.org/bsd-new
        spdx_url: https://spdx.org/licenses/BSD-3-Clause
license_rule_references:
    -   license_expression: bsd-new
        identifier: bsd-new_364.RULE
        language: en
        rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_364.RULE
        is_license_text: no
        is_license_notice: no
        is_license_reference: yes
        is_license_tag: no
        is_license_intro: no
        is_continuous: no
        is_builtin: yes
        is_from_license: no
        is_synthetic: no
        length: 5
        relevance: 100
        minimum_coverage: 80
        referenced_filenames: []
        notes:
        ignorable_copyrights: []
        ignorable_holders: []
        ignorable_authors: []
        ignorable_urls: []
        ignorable_emails: []
        text: The BSD 3-Clause License
files:
    -   path: pom.xml
        type: file
        package_data:
            -   type: maven
                namespace: org.xmlunit
                name: xmlunit-legacy
                version: 2.9.2-SANPSHOT
                qualifiers: {}
                subpath:
                primary_language: Java
                description: |
                    org.xmlunit:xmlunit-legacy
                    XMLUnit 1.x Compatibility Layer
                release_date:
                parties: []
                keywords: []
                homepage_url: https://www.xmlunit.org/
                download_url:
                size:
                sha1:
                md5:
                sha256:
                sha512:
                bug_tracking_url:
                code_view_url:
                vcs_url:
                copyright:
                declared_license_expression: bsd-new
                declared_license_expression_spdx: BSD-3-Clause
                license_detections:
                    -   license_expression: bsd-new
                        detection_log:
                            - not-combined
                        matches:
                            -   score: '100.0'
                                start_line: 1
                                end_line: 1
                                matched_length: 5
                                match_coverage: '100.0'
                                matcher: 1-hash
                                license_expression: bsd-new
                                rule_identifier: bsd-new_364.RULE
                                rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_364.RULE
                                matched_text: The BSD 3-Clause License
                other_license_expression:
                other_license_expression_spdx:
                other_license_detections: []
                extracted_license_statement: '[{''name'': ''The BSD 3-Clause License'', ''url'':
                    ''https://github.com/xmlunit/xmlunit/blob/main/xmlunit-legacy/LICENSE.txt'',
                    ''comments'': None, ''distribution'': ''repo''}]'
                notice_text:
                source_packages:
                    - pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT?classifier=sources
                file_references: []
                extra_data: {}
                dependencies:
                    -   purl: pkg:maven/org.xmlunit/xmlunit-core
                        extracted_requirement:
                        scope: compile
                        is_runtime: no
                        is_optional: yes
                        is_resolved: no
                        resolved_package: {}
                        extra_data: {}
                    -   purl: pkg:maven/junit/junit@3.8.1
                        extracted_requirement: 3.8.1
                        scope: compile
                        is_runtime: no
                        is_optional: yes
                        is_resolved: yes
                        resolved_package: {}
                        extra_data: {}
                    -   purl: pkg:maven/org.mockito/mockito-core
                        extracted_requirement:
                        scope: test
                        is_runtime: no
                        is_optional: yes
                        is_resolved: no
                        resolved_package: {}
                        extra_data: {}
                repository_homepage_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/
                repository_download_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/xmlunit-legacy-2.9.2-SANPSHOT.jar
                api_data_url: https://repo1.maven.org/maven2/org/xmlunit/xmlunit-legacy/2.9.2-SANPSHOT/xmlunit-legacy-2.9.2-SANPSHOT.pom
                datasource_id: maven_pom
                purl: pkg:maven/org.xmlunit/xmlunit-legacy@2.9.2-SANPSHOT
        for_packages: []
        scan_errors: []

This is better as only as bsd is reported and worse as the apache license of the POM data is missed as we are not looking into comments (yet)

In general the license of package manifest is best collected with --package that knows about the manifest structure. See https://github.com/nexB/scancode-toolkit/issues/707 for the longer story behind this. And also https://github.com/nexB/scancode-toolkit/issues/3024

And also https://github.com/nexB/scancode-toolkit/issues/2294 by @sschuberth and https://github.com/nexB/scancode-toolkit/issues/2552 by @hanna-modica

In contrast, the simple --license does not know that a pom is a pom. I just knows it is text.