clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 40 forks source link

Add new summarizer for recent ScanCode versions #1056

Closed lumaxis closed 3 months ago

lumaxis commented 7 months ago

To unblock https://github.com/clearlydefined/crawler/issues/502, the service needs to be ready to process files put out by newer ScanCode versions.

ScanCode major versions 31 and 32 introduced pretty drastic changes to its output format which required significant changes to our summarizing logic. To not add further special cases, that would have complicated the existing code even more, I instead opted to add a separate file that exclusively handles this new format.

qtomlinson commented 5 months ago

There is a test case available at this link. In this test case, v30 ScanCode identified two occurrences of 'NOASSERTION' ('unknown-license-reference') for the following files:

However, v3 Scancode did not report any license findings for these two files. It is worth noting that '.a' files are typically object files or static libraries used in Unix-like operating systems. This raises the question of whether the license detection in v30 ScanCode is a regression. Additionally, it would be interesting to know how v32 ScanCode performs in this regard.

lumaxis commented 5 months ago

@qtomlinson I pushed the package you mentioned as an additional test case. Looks like e.g. com/sun/jna/aix-ppc/libjnidispatch.a still gets reported with unknown-license-reference. However, the overall detected has slightly changed: https://github.com/clearlydefined/service/pull/1056/files#diff-47fcf272ac39f44f87c22747ea0503417c484186c4e0296f84f3f8e55ab8c1f7R98

Jeffrey-Luszcz commented 4 months ago

Going to the CD link to the .a files it points us to the jna-5.6.0-sources.jar though when I download it I don't see the .a files, I only see them in the regular jna-5.6.0.jar file

I ran a strings and a grep on the .a files and don't see anything that screams out a license reference to me (though see plenty of strings, and a bunch of GNU references but they seem API not license related.

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.6.0/

I think this is potentially a false positive, though I'd like to see scancode output for sure

In other .a files we might find true positives (the FFMpeg .a files for example might show this)

qtomlinson commented 4 months ago

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

Jeffrey-Luszcz commented 3 months ago

Thanks for the pointer to the scancode output. The bare 'freeware' text is coming from an include filepath from the source tree used to build the .a file

/home/0/freeware/bin/../lib/gcc/powerpcibmaix7.1.0.0/4.6.3:/home/0/freeware/bin/../lib/gcc:/home/0/freeware/bin/../lib/gcc/powerpc-ibm-aix7.1.0.0/4.6.3/../../..:/usr/lib:/lib
/opt/freeware/src/packages/BUILD/gcc-build-4.6.3/./gcc/include/unwind.h

"/opt/freeware/src/packages/BUILD" is a special historical file path used on AIX to hold open source for building purposes https://community.ibm.com/community/user/power/discussion/purpose-of-optfreewaresrcpackagesbuild

In this case "freeware" is a misnomer, they really mean "open source" the use of freeware in this embedded string can be ignored since its not really telling us that jna.a is "freeware"

That said, the "gcc-build-4.6.3/./gcc/include/unwind.h" which is a dependency for this .a file is not seen or scanned by ScanCode and is out of scope for the license results but might affect the final licensing of the .a file!

The unwind.h file is likely GPL v3 w/ GCC Runtime Library Exception,

similar to one seen here: https://github.com/far-far-away-science/hab-v2/blob/e8b63d4c9d4df487bb7d2cd0d6e10f092e20581d/software/archive/gcc/include/unwind.h#L4

Final thoughts: So in the end "freeware" is a false positive A human curation might add "GNU GPL v3 w/ GCC Runtime Library Exception in a deep audit

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

lumaxis commented 3 months ago

Leaving a note here with this run of the integrations test with both of my crawler and service branches deployed: https://github.com/clearlydefined/operations/actions/runs/9083092159/job/24961014518

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

elrayle commented 3 months ago

There's a couple of failures but as far as I can tell, all are expected

Will https://github.com/clearlydefined/operations/pull/76 impact the failing integration tests?

qtomlinson commented 3 months ago

Will clearlydefined/operations#76 impact the failing integration tests?

The tests were run with clearlydefined/operations#76. Otherwise, tests would fail at harvest. This is also 'Add auto detect schema versions', which is currently for review, trying to address.

qtomlinson commented 3 months ago

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

Thanks for the log! I have looked through the logs and summarized as the following cases:

  1. crate/cratesio/-/ratatui/0.26.0: missing field in production (projectWebsite: 'https://ratatui.rs'). This has been fixed by my recent PR

  2. Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

    • new scancode summarizer: licensed.toolScore.discovered: 0
    • production: licensed.toolScore.discovered: 1 Need to look into the definition detail to see why the score is different.
  3. File licenses are different for the following 3 cases (differences extracted from the log):

Need to confirm which one is correct.

  1. Different declared license:
    • npm/npmjs/-/redis/0.1.0
    • declared license is MIT from the new scancode summarizer and causing score change.
    • declared code is empty in production
    • pypi/pypi/-/sdbus/0.12.0
    • new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
    • production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

If we have confirmed that the new declared and file licenses, and scoring are correct, we can update the comparison by uploading the fixtures.

@Jeffrey-Luszcz , could you help us verify differences in file license (point 3) and declared license (point 4)?

qtomlinson commented 3 months ago

Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

This is due to the difference in copyright detection in ScanCode v32 and ScanCode v30. 30.3.0.json detects copyrights in platformdirs-4.2.0/LICENSE; while 32.3.0.json result does not.

@Jeffrey-Luszcz Is this a regression that needs to be reported to ScanCode?

Jeffrey-Luszcz commented 3 months ago

3 [git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641]

The deny.toml file in this result might be something we should EXCLUDE from scans since it does not represent actual license content.

deny.toml is a config file for a testing tool https://github.com/EmbarkStudios/cargo-deny and exists to create allow/deny lists for licenses used by components in the dependency list.

git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below, expected: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}} actual: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND Unicode-DFS-2016 AND WTFPL","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}

In this case the deny.toml file contains the following license strings: "Apache-2.0", "BSD-2-Clause", "BSD-3-Clause", "ISC", "MIT", "Unicode-DFS-2016", "WTFPL",

The scan results are better now but also missing ISC and MIT seen in the deny.toml file section containing the other license names

qtomlinson commented 2 months ago

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

  • expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"} actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}

The clearlydefined/downloaded/LICENSE is the license obtained from https://licenses.nuget.org/Apache-2.0 (licenseUrl from the component manifest). The difference in licenses reported for this file lies in how CD interprets ScanCode raw results in new (actual) and legacy (expected) summarizer.

In both 30.3.0.json and 32.3.0.json ScanCode results, there is a matching of "ECL 2.0" with scores of around 48% for the file:

Jeffrey-Luszcz commented 2 months ago

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

When using the v32 result, the detected_license_expression_spdx from ScanCode is directly utilized, which results in the license being reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". It appears that this detected_license_expression_spdx includes license findings below a score of 80. @Jeffrey-Luszcz @lumaxis @elrayle The question arises whether this inclusion of license findings with a lower matching score is the desired behavior.

I would consider the "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". result to be incorrect. This license text is the Apache 2.0 text (with possibly some nuget doc boilerplate like:

Notes
This license was released January 2004

SPDX web page
https://spdx.org/licenses/Apache-2.0.html
Notice
This license content is provided by the [SPDX project](https://spdx.dev/). For more information about licenses.nuget.org, see [our documentation](https://aka.ms/licenses.nuget.org).

Data pulled from [spdx/license-list-data](https://github.com/spdx/license-list-data) on February 9, 2023.

NuGet.Protocol/6.7.1 does NOT contain the ECL 2.0 as an option for the package, only AL 2.0. The ECL 2.0 is a modified version of the AL 2.0 with ADDITIONAL text in the Patent Clause. ScanCode should be able to differentiate between to variants of a similar license.

The three "licenses" that we are thinking about here are a pure Apache 2.0, the NuGet Apache 2.0 file with some additional info at the bottom and top and a "pure" ECL 2.0 license text (where B represents the original patent clause, B' the modified patent clause and D is the Nuget text talking about SPDX from the block above: Apache 2.0 NuGet ECL 2.0 A A A B B B' C C C D

The noise of adding "OR (AL 2.0 or EC 2.0)" is pretty bad and not user friendly esp for a license like the ECL 2.0 which is seen in only a handful of packages. The AL 2.0 is somewhere like the 3rd most popular license, I'd be worried if they all started getting reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)".

I wonder if scancode is seeing a license that it thinks might superset of the Apache 2.0 because of the "D" text in the NuGet license file and thus returning a bunch of possibilities it wouldn't if it had a more pure Apache 2.0 text. It still should realize that the ECL 2.0 is a special sub-set or modified version of the Apache 2.0 in my opinion....

It would be worth seeing if we start getting "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". noisy results for things we expect to be pure Apache 2.0 due to a change in ScanCode of if this is a special case due to the Nuget spdx Noise...

qtomlinson commented 1 month ago
  • pod/cocoapods/-/SoftButton/0.1.0, 1 file license is different. expected: {"path":"README.md","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}} actual: {"path":"README.md","license":"MIT","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}

Need to confirm which one is correct.

  1. Different declared license:
  • npm/npmjs/-/redis/0.1.0

    • declared license is MIT from the new scancode summarizer and causing score change.
    • declared code is empty in production
  • pypi/pypi/-/sdbus/0.12.0

    • new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
 * fixture pre v32 , declared: 'GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-or-later',

@Jeffrey-Luszcz Could you kindly verify the validity of these license differences and confirm that the new version is the correct one?

Jeffrey-Luszcz commented 1 month ago

pypi/pypi/-/sdbus/0.12.0 new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0' production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

SDBus explicitly says its LGPL 2.1 in its README. This scancode license string seems overly complicated for a simple LGPL 2.1 declaration likely due to GPL and LGPL license text found at top level. I would make the case that either we do a curation for this component or talk about how to handle the LGPL declared case where a GPL file is shipped along with the LGPL file