aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.11k stars 546 forks source link

Improve quality and tracing of license detection in Debian copyright files #2390

Open pombredanne opened 3 years ago

pombredanne commented 3 years ago
  1. we should be able to recover from mostly OK but not correct copyright files such as this one: https://metadata.ftp-master.debian.org/changelogs//main/p/pulseaudio/pulseaudio_14.2-1_copyright (this may be a ticket for the debian-inspector debut library though) See https://github.com/nexB/debut/issues/6 Recover parsing from almost machine-readable copyright files

  2. we should have the ability to trace the intermediate detection results (see also #2389 ) for each paragraph of a copyright file

  3. we could establish a mapping of declared License "ids"

  4. there is an implicit notion of primary vs. secondary licenses in a copyright file and we should leverage this: a paragraph with "Files: *" applies to the package as a whole. This may mean a system-wide model change to track primary vs. secondary license or have the ability to track that in a license expression. See https://github.com/nexB/debut/issues/8 Determine the primary license from a copyright file

pombredanne commented 3 years ago

From a chat with @chinyeungli

btw, just a note that these massive license_expression may contains irrelevant info such that some of the gpl-2.0 was detected because the copyright file states the debian packaging is under gpl-2.0 while the primary component may not contain any gpl code (For instance, https://changelogs.ubuntu.com/changelogs/pool/universe/s/signon/signon_8.59+17.10.20170606-0ubuntu1/copyright )

pombredanne commented 3 years ago

From a chat with @JonoYang based on scanning a Ubuntu-based Docker image in https://github.com/nexB/scancode.io/ that contained https://packages.ubuntu.com/bionic-updates/gcc-7

the package gcc-7-base@7.5.0-3ubuntu1~18.04 has the license expression of:

agpl-3.0 AND amd-historical AND artistic-2.0 AND bsd-new AND bsd-no-disclaimer AND bsd-no-disclaimer-unmodified AND bsd-original AND bsd-original-uc AND bsd-original-uc-1986 AND bsd-simplified AND bsla AND d-zlib AND delorie-historical AND flex-2.5 AND gfdl-1.2 AND gpl-1.0-plus AND gpl-2.0 AND gpl-2.0-plus AND gpl-3.0 AND gpl-3.0-plus AND gpl-3.0-plus WITH gcc-exception-3.1 AND hs-regexp AND intel-osl-1989 AND intel-osl-1993 AND lgpl-2.0 AND lgpl-2.0-plus AND lgpl-2.1 AND lgpl-2.1-plus AND lgpl-3.0-plus WITH cygwin-exception-lgpl-3.0-plus AND mit AND newlib-historical AND nilsson-historical AND osf-1990 AND other-copyleft AND other-permissive AND public-domain AND sunpro AND tex-exception AND uoi-ncsa AND viewflow-agpl-3.0-exception AND warranty-disclaimer AND wide-license AND wtfpl-1.0 AND x11-hanson AND x11-lucent AND zlib AND zlib-acknowledgement AND (commercial-license OR proprietary-license)

I'm not sure how the agpl-3.0 detection happened. I looked in scanpipe/scancode.io results for the Resources associated to the package gcc-7-base and I did not find any Resources attached to this package. I downloaded the copyright file for this package from ubuntu (http://changelogs.ubuntu.com/changelogs/pool/main/g/gcc-7/gcc-7_7.5.0-3ubuntu1~18.04/copyright), scanned it, and agpl-3.0 is not detected as a license.

pombredanne commented 3 years ago

From a chat with @mjherzog based on scanning a Ubuntu-based Docker image in https://github.com/nexB/scancode.io/

We have a problem of license "proliferation" for some packages that we need to fix especially Debian system packages found in a Docker scan. One example is where we have the license expression:

agpl-3.0 AND agpl-3.0-plus AND bloomberg-blpapi AND gpl-1.0-plus AND gpl-2.0 AND gpl-2.0-plus AND lgpl-2.0-plus AND lgpl-2.1 AND lgpl-2.1-plus AND lgpl-3.0 AND mit AND other-permissive AND sun-rpc AND warranty-disclaimer

... for six files from pulseaudio (www.pulseaudio.org in Homepage URL).

I researched the Debian Copyright file from https://metadata.ftp-master.debian.org/changelogs//main/p/pulseaudio/pulseaudio_5.0-13_copyright and found:

  • Overall license is lgpl-2.1-plus (also what we have DejaCode) and most Copyright entries say: "License: LGPL-2.1+"
  • I also see bloomberg-blpapi, mit, sun-rpc and warranty-disclaimer plus one file under gpl-2.0-plus

My guess is that there may be some sort of license detection bug for the agpl and the other gpl and lgpl versions

pombredanne commented 3 years ago

See https://github.com/nexB/scancode.io/issues/103#issuecomment-815665295 for a detailed description of the problems

pombredanne commented 3 years ago

To improve the tracing I think we could have this simple way:

  1. decouple entirely the processing of a whole copyright file data (and copyright statements) to have a function that deals only with the license detection and returns license matches.
  2. expose a new option in license detection such as --debian-copyright and some arg such as_debian_copyright that would treat *copyright files as if these were debian copyright files.

This way we can get regular license detection results from just copyright files irrespective of being in the cntext of a package or not.

pombredanne commented 3 years ago

@AyanSinhaMahapatra FYI ^