clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 41 forks source link

Investigate regression cases found in integration tests after updating ScanCode #1183

Open qtomlinson opened 3 months ago

qtomlinson commented 3 months ago

This comes from the discussion on PR to integrate new ScanCode, specifically on the license differences in integration tests before and after integrating v32 ScanCode.

  1. nuget/nuget/-/NuGet.Protocol/6.7.1. See discussion on root cause at https://github.com/clearlydefined/service/pull/1056#issuecomment-2209603879.
    • expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0", ...}
    • actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)", ...}
    • The clearlydefined/downloaded/LICENSE is the license obtained from https://licenses.nuget.org/Apache-2.0 (licenseUrl from the component manifest)
  2. pypi/pypi/-/sdbus/0.12.0. Need to investigate the root cause and fix.
    • expected: declared: 'GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-or-later'
    • actual: declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
qtomlinson commented 3 months ago

@elrayle @yashkohli88

yashkohli88 commented 2 months ago

I implemented a change to resolve the license issue for Nuget.Protocol coordinate. Now the code will only consider the matches if the score is greater than 80%. But it triggered other components to fail in the below mentioned places. 1) pypi/pypi/-/platformdirs/4.2.0 - LicenseRef-scancode-unknown-license-reference is being reported by scancode in PKG-INFO file with 100 score. This adds LicenseRef-scancode-unknown-license-reference in the list of discovered license.

Scancode result -

{
            "license_expression": "unknown-license-reference",
            "license_expression_spdx": "LicenseRef-scancode-unknown-license-reference",
            "from_file": "cd-aYG6pL/platformdirs-4.2.0/PKG-INFO",
            "start_line": 11,
            "end_line": 11,
            "matcher": "2-aho",
            "score": 100,
            "matched_length": 3,
            "match_coverage": 100,
            "rule_relevance": 100,
            "rule_identifier": "unknown-license-reference_see_license_at_manifest_1.RULE",
            "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/unknown-license-reference_see_license_at_manifest_1.RULE",
            "matched_text": "License-File: LICENSE",
            "matched_text_diagnostics": "License-File: LICENSE"
          }

Below is the licensed section from definition for changed code.

"licensed": {
        "declared": "MIT",
        "toolScore": {
            "total": 45,
            "declared": 30,
            "discovered": 0,
            "consistency": 0,
            "spdx": 15,
            "texts": 0
        },
        "facets": {
            "core": {
                "attribution": {
                    "unknown": 22
                },
                "discovered": {
                    "unknown": 19,
                    "expressions": [
                        "LicenseRef-scancode-unknown-license-reference AND MIT",
                        "MIT"
                    ]
                },
                "files": 22
            }
        },
        "score": {
            "total": 45,
            "declared": 30,
            "discovered": 0,
            "consistency": 0,
            "spdx": 15,
            "texts": 0
        }
    }

2) 'conda/conda-forge/linux-aarch64/numpy/1.16.6-py36hdc1b780_0' - The 'NOASSERTION' keyword has been replaced by 'LicenseRef-scancode-unknown-license-reference' on many instances. Some places this unknown license expression has been added. Below is the comparison from integration test

expected: {"path":"info/about.json","license":"BSD-3-Clause","hashes":{"sha1":"75bee71c98128117d0a567f2ad35cd01f75750e0","sha256":"5f961516903bac3ca1dd9111c72a858f852b6112da3fda7829bf5d825cd25b37"}}
actual:   {"path":"info/about.json","license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference","hashes":{"sha1":"75bee71c98128117d0a567f2ad35cd01f75750e0","sha256":"5f961516903bac3ca1dd9111c72a858f852b6112da3fda7829bf5d825cd25b37"}}
-------------------
expected: {"path":"info/recipe/meta.yaml","license":"BSD-3-Clause","hashes":{"sha1":"f1022538c9bd0fb683318f39954ae2a085d73a10","sha256":"3d2a25d96d805e0c5b0cab0615118d8bcb860ef92611b188534da32a301be623"}}
actual:   {"path":"info/recipe/meta.yaml","license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference","hashes":{"sha1":"f1022538c9bd0fb683318f39954ae2a085d73a10","sha256":"3d2a25d96d805e0c5b0cab0615118d8bcb860ef92611b188534da32a301be623"}}
-------------------
expected: {"path":"info/recipe/meta.yaml.template","license":"BSD-3-Clause","hashes":{"sha1":"4312867c86b5c46e98b65ed788975a530fd3236a","sha256":"8902b1e3e0205039794cd2702848b717055a0f5dbed0697249c8a4ddffc0543f"}}
actual:   {"path":"info/recipe/meta.yaml.template","license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference","hashes":{"sha1":"4312867c86b5c46e98b65ed788975a530fd3236a","sha256":"8902b1e3e0205039794cd2702848b717055a0f5dbed0697249c8a4ddffc0543f"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy/distutils/fcompiler/absoft.py","attributions":["Copyright Absoft Corporation","Copyright Absoft Corporation 1994-2002 Absoft Pro FORTRAN","Copyright Absoft Corporation 1994-1998 mV2 Cray Research, Inc. 1994-1996 CF90"],"hashes":{"sha1":"af8d91b136b5a80ae20f9a7245809be4cc852420","sha256":"00a6e3e6e1abf1da460cbcd12096dd5275d702d17fe64e09aa7ab04d6bf2fad4"}}
actual:   {"path":"lib/python3.6/site-packages/numpy/distutils/fcompiler/absoft.py","attributions":["Copyright Absoft Corporation","Copyright Absoft Corporation 1994-2002 Absoft Pro FORTRAN","Copyright Absoft Corporation 1994-1998 mV2 Cray Research, Inc."],"hashes":{"sha1":"af8d91b136b5a80ae20f9a7245809be4cc852420","sha256":"00a6e3e6e1abf1da460cbcd12096dd5275d702d17fe64e09aa7ab04d6bf2fad4"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy/f2py/f2py2e.py","license":"BSD-3-Clause AND NOASSERTION","attributions":["Copyright 1999 2011 Pearu Peterson","Copyright 1999 - 2011 Pearu Peterson"],"hashes":{"sha1":"a6c6f2bbc8cd3bed85610cf122cd6264c949dae3","sha256":"c3dcd2246ded9c23323ab81926a8598845280279c8ee853ad64619cefb0b75fa"}}
actual:   {"path":"lib/python3.6/site-packages/numpy/f2py/f2py2e.py","license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference","attributions":["Copyright 1999-2011 Pearu Peterson","Copyright 1999 - 2011 Pearu Peterson"],"hashes":{"sha1":"a6c6f2bbc8cd3bed85610cf122cd6264c949dae3","sha256":"c3dcd2246ded9c23323ab81926a8598845280279c8ee853ad64619cefb0b75fa"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy/f2py/setup.py","license":"BSD-3-Clause AND NOASSERTION","attributions":["Copyright 2001-2005 Pearu Peterson"],"hashes":{"sha1":"0f3d561e9548e842b8694b5fa479ebe718245ce1","sha256":"a8d088a913dca445212418e286d11711ee088a5e170d8551008fec666ef16613"}}
actual:   {"path":"lib/python3.6/site-packages/numpy/f2py/setup.py","license":"BSD-3-Clause AND LicenseRef-scancode-free-unknown","attributions":["Copyright 2001-2005 Pearu Peterson"],"hashes":{"sha1":"0f3d561e9548e842b8694b5fa479ebe718245ce1","sha256":"a8d088a913dca445212418e286d11711ee088a5e170d8551008fec666ef16613"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy/f2py/__pycache__/f2py2e.cpython-36.pyc","license":"BSD-3-Clause AND NOASSERTION","attributions":["Copyright 1999 2011 Pearu Peterson","Copyright 1999 - 2011 Pearu Peterson"],"hashes":{"sha1":"13f8ab8f760195b5599f66f4be8c8381f68ecad8","sha256":"50297551bfc28e1e9d91879accc23544a05b2446f2f121ee32dc30acc87a8fa0"}}
actual:   {"path":"lib/python3.6/site-packages/numpy/f2py/__pycache__/f2py2e.cpython-36.pyc","license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference","attributions":["Copyright 1999-2011 Pearu Peterson","Copyright 1999 - 2011 Pearu Peterson"],"hashes":{"sha1":"13f8ab8f760195b5599f66f4be8c8381f68ecad8","sha256":"50297551bfc28e1e9d91879accc23544a05b2446f2f121ee32dc30acc87a8fa0"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy/f2py/__pycache__/setup.cpython-36.pyc","license":"BSD-3-Clause AND NOASSERTION","attributions":["Copyright 2001-2005 Pearu Peterson"],"hashes":{"sha1":"f5b2d8b039f675eb7b28c52b936a39c092832f61","sha256":"0c23abb7e046eb20beab087ae9d791a957fc553c191811c94c6ada2d08121a21"}}
actual:   {"path":"lib/python3.6/site-packages/numpy/f2py/__pycache__/setup.cpython-36.pyc","license":"BSD-3-Clause AND LicenseRef-scancode-free-unknown","attributions":["Copyright 2001-2005 Pearu Peterson"],"hashes":{"sha1":"f5b2d8b039f675eb7b28c52b936a39c092832f61","sha256":"0c23abb7e046eb20beab087ae9d791a957fc553c191811c94c6ada2d08121a21"}}
-------------------
expected: {"path":"lib/python3.6/site-packages/numpy-1.16.6.dist-info/METADATA","license":"BSD-3-Clause AND NOASSERTION","hashes":{"sha1":"854d9701eb6441931a7916c8780a5e74bedd5831","sha256":"f8f6b36613e999ecc1fe61cea6ba132d66708aeb7c132c69ce587a0fd25f1b9b"}}
actual:   {"path":"lib/python3.6/site-packages/numpy-1.16.6.dist-info/METADATA","license":"BSD-3-Clause AND LicenseRef-scancode-free-unknown","hashes":{"sha1":"854d9701eb6441931a7916c8780a5e74bedd5831","sha256":"f8f6b36613e999ecc1fe61cea6ba132d66708aeb7c132c69ce587a0fd25f1b9b"}} 

3) pypi/pypi/-/sdbus/0.12.0 - This coordinate is in discussion to raise a ticket with scancode about its license findings. 4) pod/cocoapods/-/SoftButton/0.1.0 – Readme.MD file license is detected in new code which was not getting in earlier version 5) crate/cratesio/-/ratatui/0.26.0 – testcase failing due to change in repo namespace. All other things are working as previously 6) npm/npmjs/-/redis/0.1.0 – Declared license is getting populated, notice is generated, scores improved. 7) Nuget.Protocol/6.7.1 – NOASSERTION and ECL has been taken care off. Test case failing due to change in the score. 8) deb/debian/-/mini-httpd/1.30-0.2_arm64 – Passed 9) debsrc/debian/-/mini-httpd/1.30-0.2 – Passed 10) pod/cocoapods/-/xcbeautify/0.9.1 – Passed 11) maven/mavencentral/org.apache.httpcomponents/httpcore/4.4.16 – Passed 12) maven/mavengoogle/android.arch.lifecycle/common/1.0.1 – Passed 13) go/golang/rsc.io/quote/v1.3.0 – Passed 14) composer/packagist/symfony/polyfill-mbstring/v1.28.0 – Passed 15) gem/rubygems/-/sorbet/0.5.11226 – Passed 16) git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641 – Passed

Here are the code changes related to this - https://github.com/yashkohli88/service/pull/5

In my opinion regarding 'LicenseRef-scancode-unknown-license-reference' cases, this license match is triggered specifically by 'License' keyword present in those files.

yashkohli88 commented 2 months ago

Most of the differences have occured due to presence of 'License' keyword in any of the file. New scancode triggers 'LicenseRef-scancode-unknown-license-reference' whenever a license keyword is found in the file. In both the above failed scenarios I have observed this behavior. Attached screenshot where 'matched_text' field from scancode results can be observed to contain the text where this match is found.

'pypi/pypi/-/platformdirs/4.2.0' - There is a 'LicenseRef-scancode-unknown-license-reference' reported in discovered license. image

'conda/conda-forge/linux-aarch64/numpy/1.16.6-py36hdc1b780_0' - Difference 1 - "path":"info/about.json" - Expected - "license":"BSD-3-Clause" Actual - "license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference" LicenseRef-scancode-unknown-license-reference is detected because of the keyword 'License.txt'. This can be verified from the screenshot below.

image

Difference 2 - "path":"info/recipe/meta.yaml" - Expected - "license":"BSD-3-Clause" Actual - "license":"BSD-3-Clause AND LicenseRef-scancode-unknown-license-reference"

image

qtomlinson commented 1 month ago

@yashkohli88 Thanks for the detailed explanation! I have summarized the findings of adding filtering below: Pros:

  1. Fixed 1 out of 2 license detection differences in Nuget/nuget/-/NuGet.Protocol/6.7.1
  2. Reduced number of license detection differences for conda/conda-forge/linux-aarch64/numpy/1.16.6-py36hdc1b780_0.
    • Prior to filtering, license detection difference is observed in 12 files
    • After filtering is added, this number is reduced to 9.
  3. Fixed the license detection difference in git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641. The definition is now the same as production deployment

Cons:

  1. regression: License detection for file platformdirs-4.2.0/PKG-INFO in pypi/pypi/-/platformdirs/4.2.0 now includes LicenseRef-scancode-unknown-license-reference.
qtomlinson commented 1 month ago

As per our discussion, need to update the fixture and track the ones with regression in a documentation in operation repo.