anthonyharrison / sbom4python

A tool to generate a SBOM (Software Bill of Materials) for an installed Python module
Apache License 2.0
25 stars 6 forks source link

Invalid SPDX generated #6

Closed vargenau closed 1 year ago

vargenau commented 1 year ago

The SPDX file is in some cases invalid because of incorrect license identifiers.

scancode-toolkit.spdx.txt

Examples in the above scan:

PackageLicenseConcluded: Apache-2
PackageLicenseConcluded: ASL 2.0
PackageLicenseConcluded: BSD
PackageLicenseConcluded: LGPL
PackageLicenseConcluded: MIT/X

I understand the information is taken from a package metadata that is not in SPDX format, but you should not output it as it is. Or you are able to map it to a correct SPDX identifier, or you should create a custom LicenseRef-

anthonyharrison commented 1 year ago

@vargeenau Thanks for the report. The aim was to identify any licences if they were included but as you rightly point out, the license information is obtained from the meta data and there are lots of issues with licenses in the meta data which need to be tidied up. Can I suggest you raise an issue with Scancode to update the licences to be correct SPDX identifiers?

Automatcially mapping it to the 'correct' identifier isn't feasible, for example what would LGPL map to - LGPL2, LGPL2.1 ?

However I could simply ignore the license if it isn't a valid SPDX Id and not include it (note that the NONE or NOASSETION semantics do not cover invalid licences) but this seems to be wrong when the author has attempted to specify a license. Or I could create a custom LicenceRef as you suggest but this seems to be hiding the issue.

I will have a think how best to proceed.

BTW you could try usingthe --exclude-license option if there are lots of incorrect licenses.

vargenau commented 1 year ago

The report was produced for ScanCode, but the incorrect licenses are not from ScanCode but from dependencies. I have created some pull requests for them: https://github.com/kmike/text-unidecode/pull/12 https://github.com/pdfminer/pdfminer.six/pull/866 https://github.com/harlowja/fasteners/pull/104

I agree that LGPL cannot be automatically mapped, but Apache-2 and ASL 2.0 could.

anthonyharrison commented 1 year ago

@vargenau I have made a number of updates in the latest release (0.9.0) which hopefully should result in the generation of an SPDX document with valid licenses. Let me know if you have any issues.

vargenau commented 1 year ago

Hi @anthonyharrison Thank you for your quick fix! The SPDX code is now valid.

Two remarks:

In file cryptography, cryptography.spdx.txt

BSD-3-Clause or Apache-2.0 should be BSD-3-Clause OR Apache-2.0 Keywords are case-sensitive and must be in upper case.

In file chardet, chardet.spdx.txt

you guessed LGPL-2.0-or-later, it is in fact LGPL-2.1-or-later, but I do not know if it is easy to do better.

anthonyharrison commented 1 year ago

@vargenau

Thanks for pointing out the error with the case of the boolean operators in the license expression. I will work on a fix for this although I note that the latest version of the cryptography module (40.0.1) appears to be correct (and the license has changed).

I was advised that LGPL is assumed to mean LGPL-2.0-or-later. Given the 'error' in this assumption for chardet, the only way to fix this is to ensure chardet specifies the correct license in its metadata.