aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.14k stars 552 forks source link

Improve license detection for wrong SPDX license identifiers #3912

Open AyanSinhaMahapatra opened 2 months ago

AyanSinhaMahapatra commented 2 months ago

Consider the following text:

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

Here BSD is not a valid license expression and even adding a rule is insufficient because the SPDX-License-Identifier based detection was moved before the hash license detection.

We should either:

  1. do the hash license detection first so we can catch these with rules, and then do the SPDX identifier based detection
  2. if we get unknown-spdx we consider license detection with rules
  3. Also optionally consider license detection with required phrase rules if nothing works (would lose license expression info for this potentially)?
pombredanne commented 2 months ago

create a rule for gpl-2.0-plus AND bsd-new with this text

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

and make this 99 relevant

that's the approach for BSD's that will be picked over the SPDX detection, it should at least

pombredanne commented 2 months ago

Here are examples https://github.com/search?q="SPDX-License-Identifier%3A+(GPL-2.0%2B+OR+BSD)"&type=code and

https://github.com/BPI-SINOVOIP/BPI-R2PRO-BSP/blob/938b4b14d8ee8e332a6cf04111a11d9a95156a6d/kernel/include/dt-bindings/reset/amlogic%2Cmeson-axg-reset.h#L9

pombredanne commented 2 months ago

I pushed a fix in https://github.com/aboutcode-org/scancode-toolkit/pull/3905/commits/c581828c12c5b692f9b0c080f4da07b9e014285f

The default sort order or LicenseMatch was based on the "matcher" string, hence "1-spdx-id" would always beat a "2-aho" match. Now we have a new "matcher_order" integer attribute that is used to sort instead and the hash and aho always take precedence over SPDX.