aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

Support exporting license and copyright information in debian copyright format to integrate with REUSE #2235

Open msohn opened 4 years ago

msohn commented 4 years ago

Short Description

In order to use results from scancode in projects adopting REUSE [1] it would be nice if copyright and license information could be exported in debian copyright format [2] which is used in REUSE for non-intrusive bulk licensing. This is useful e.g. in projects implemented in golang where source code of dependencies is pulled into the vendor/ folder of the project's repository.

[1] https://reuse.software/faq/#bulk-license [2] https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/#disclaimer-field

Possible Labels

Select Category

Describe the Update

Export of copyright and license information in debian copyright format which is used by REUSE.

How This Feature will help you/your organization

Automate compliance with REUSE especially for golang projects which tend to have a lot of (source code of) dependencies in the vendor/ folder. Both creation of initial .reuse/dep5 file and updating it whenever dependencies change in order to declare copyright and license information for such dependencies is a lot of work when done manually.

We currently use a poor-man's partial solution using askalono which can only generate the Files: and License: fields of debian copyright format.

Possible Solution/Implementation Details

Without knowing any details about scancode's implementation I guess this could be implemented as a post scan plugin.

Example/Links if Any

Can you help with this Feature

I am python 3 literate so I could try to help if you can provide hints how to start that. I have no clue about the current scancode-toolkit implementation.

pombredanne commented 4 years ago

@msohn that would be awesome indeed. The way it would typically be done might be a output plugin. And in terms of building blocks the key elements would be:

  1. collecting matched license expressions and license texts with --license --license-text
  2. collecting copyrights with --copyright and ideally except for these part of licenses with --filter-clues
  3. using the debut library that knows everything about Debian files and Debian copyright files (See https://github.com/nexB/debut and https://github.com/nexB/debut/blob/master/src/debut/copyright.py#L311 ) and that is already available in ScanCode and if it needs patching we can patch it alright in a snap.

There is an existing tool too in Debian at https://salsa.debian.org/debian/decopy by @maxyz packaged by @margamanterola and recently updated by @jspricke and @jelmer

@maxyz and I had chatted about using ScanCode to improve its license detection in the past. I am not sure if there is a lot we can reuse directly today but that would be nice to do so if possible and the intent is clearly the same. Note that its core logic for grouping by license may be similar to @JonoYang https://github.com/nexB/scancode-toolkit/tree/develop/src/summarycode

pombredanne commented 3 years ago

2417 is a WIP pr to add some support for this.