aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 549 forks source link

Multiple LicenseID in SPDX #3258

Open vargenau opened 1 year ago

vargenau commented 1 year ago

Description

SPDX standard states that "This identifier shall be unique within the SPDX document". https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/

In the attached SPDX file, some license ids are reported multiple times:

grep LicenseID phpwiki.spdx.txt | sort | uniq -c
      1 LicenseID: LicenseRef-scancode-bsd-unmodified
      1 LicenseID: LicenseRef-scancode-commercial-license
      1 LicenseID: LicenseRef-scancode-free-unknown
      1 LicenseID: LicenseRef-scancode-mysql-linking-exception-2018
      5 LicenseID: LicenseRef-scancode-other-permissive
     20 LicenseID: LicenseRef-scancode-php-2.0.2
     15 LicenseID: LicenseRef-scancode-proprietary-license
      3 LicenseID: LicenseRef-scancode-public-domain
     23 LicenseID: LicenseRef-scancode-unknown-license-reference
      3 LicenseID: LicenseRef-scancode-unknown-spdx
      1 LicenseID: LicenseRef-scancode-warranty-disclaimer

How To Reproduce

svn checkout https://svn.code.sf.net/p/phpwiki/code/trunk phpwiki
./scancode -c -l -i --license-text --spdx-tv phpwiki.spdx phpwiki

Resulting SPDX file:

phpwiki.spdx.txt

System configuration

./scancode --version
ScanCode version: 32.0.0rc1
ScanCode Output Format version: 3.0.0
SPDX License list version: 3.19

Ubuntu 22.10

vargenau commented 1 year ago

The validator should now flag this. See https://github.com/spdx/spdx-java-tagvalue-store/issues/42 and https://github.com/spdx/spdx-java-tagvalue-store/pull/43

pombredanne commented 1 year ago

Actually we are using an SPDX namespace for our licenses, meaning these "LicenseRef-scancode" ids are as stable as the SPDX ids themselves and should not be treated the same.

vargenau commented 1 year ago

Hi Philippe,

There are in fact two cases.

For LicenseRef-scancode-php-2.0.2, you have in the SPDX file 20 times the exact same text:

LicenseID: LicenseRef-scancode-php-2.0.2
LicenseName: PHP License 2.0.2
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/license/php-2.0.2.yml
</text>
ExtractedText: <text>// | This source file is subject to version 2.0 of the PHP license,       |
// | that is bundled with this package in the file LICENSE, and is        |
// | available at through the world-wide-web at                           |
// | http://www.php.net/license/2_02.txt.                                 |
// | If you did not receive a copy of the PHP license and are unable to   |
// | obtain it through the world-wide-web, please send a note to          |
// | license@php.net so we can mail you a copy immediately.               |</text>

It should be present only once. It's the definition of LicenseRef-scancode-php-2.0.2, there is no need to repeat it.

For LicenseRef-scancode-unknown-spdx, you have:

LicenseID: LicenseRef-scancode-unknown-spdx
LicenseName: Unknown SPDX license detected but not recognized
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-spdx.yml
</text>
ExtractedText:  * SPDX-License-Identifier: Artistic-1.0+

and also

LicenseID: LicenseRef-scancode-unknown-spdx
LicenseName: Unknown SPDX license detected but not recognized
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-spdx.yml
</text>
ExtractedText: * Adding SPDX-License-Identifier in PHP source files

This is not correct, you have two contradicting definitions of the same LicenseID. And you cannot know which definition relates to which file.

You should have something like:

# File

FileName: ./phpwiki/lib/HttpClient.php
SPDXID: SPDXRef-83
FileChecksum: SHA1: 99985858f0a2d539954e5bc6525892a6d6086ab9
LicenseConcluded: NOASSERTION
LicenseInfoInFile: LicenseRef-scancode-unknown-spdx-1
FileCopyrightText: <text>Copyright (c) 2003 Simon Willison, Incutio Limited
Copyright (c) 2004,2006-2007 Reini Urban
</text>
# File

FileName: ./phpwiki/locale/it/pgsrc/NoteDiRilascio
SPDXID: SPDXRef-636
FileChecksum: SHA1: 1d528511bfc1256c544321d1950fb06319ef0f9f
LicenseConcluded: NOASSERTION
LicenseInfoInFile: GPL-2.0-only
LicenseInfoInFile: LicenseRef-scancode-unknown-license-reference
LicenseInfoInFile: LicenseRef-scancode-unknown-spdx-2
FileCopyrightText: NONE
LicenseID: LicenseRef-scancode-unknown-spdx-1
LicenseName: Unknown SPDX license detected but not recognized
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-spdx.yml
</text>
ExtractedText:  * SPDX-License-Identifier: Artistic-1.0+

and

LicenseID: LicenseRef-scancode-unknown-spdx-2
LicenseName: Unknown SPDX license detected but not recognized
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-spdx.yml
</text>
ExtractedText: * Adding SPDX-License-Identifier in PHP source files
vargenau commented 1 year ago

@pombredanne what do you think about these two cases?

vargenau commented 7 months ago

Bug still present in scancode-toolkit 32.1.0