aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.15k stars 553 forks source link

False positive AGPL detection from a mere URL #2877

Open pombredanne opened 2 years ago

pombredanne commented 2 years ago

We are detecting an AGPL with agpl-3.0-plus_152.RULE and this text http://www.ghostscript.com ... for instance from https://github.com/ReactiveX/rxjs/blob/6.x/README.md

This is noisy.

There are two ways out:

  1. remove these short URL and related rules since they are not enough of their own to be a license detection, or
  2. treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection
AyanSinhaMahapatra commented 2 years ago

@pombredanne I have a couple of possible cases here which could be clues, as opposed to detections. what do you think?

i.e. they will have is_clue as True in their .yml files and will be reported in license_clues, and not at license_detections.

Cases where is_clue = True:

  1. urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license
  2. links to websites like ghostscript like above
  3. unknown references like http://licenses.nuget.org/
  4. links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider
  5. links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt
  6. references/words which are know to be a license, like: mupdf or ghostscript or Affero
  7. words that are license names but generic (?): like beerware or borceux
  8. hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1
  9. abbreviations of licenses (could tags like gpl be also included in this?): like PSFL
  10. other unknown license references which are references to files/websites/packages and could not be resolved successfully

Cases where is_clue is False: i.e. these are valid detections

  1. link to license texts or specific licenses https://spdx.org/licenses/bsd-2-clause

Also attaching a csv file with a subset of the rules (is_license_reference = True and relevance < 100): clues_possible.csv

pombredanne commented 2 years ago

This makes 100% sense... we have to thread lightly though..

  1. urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license : ==> IMHO several are bona fide detection rules not mere clues.
  2. links to websites like ghostscript like above: this is the case for mere clues
  3. unknown references like http://licenses.nuget.org/ for this bare URL, likely yes, but https://licenses.nuget.org/(LGPL-2.0-only WITH FLTK-exception OR Apache-2.0+) would need to be detected possibly with a new matcher or by extending the matcher for SPDX license identifiers and would in all cases not be a mere clue in a rule
  4. links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider agreed
  5. links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt it depends. In many cases these are well know repos with stable licensing... another possibility could be to have a step to fetch things at the URL end and detect that instead ... but that out of scope for this issue ;)
  6. references/words which are know to be a license, like: mupdf or ghostscript or Affero agreed
  7. words that are license names but generic (?): like beerware or borceux beerware is surely a proper rule and not a mere clue, borceux would be clue alright ... so it really depends
  8. hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1 this is borderline and could be a proper rule rather than a clue... some thinking needed
  9. abbreviations of licenses (could tags like gpl be also included in this?): like PSFL : agreed. For the GPL one I think we would need to have a special post-matching processing possibly looking at case and mixed case... which BTW would mean that the is_clue is an attribute of a license rule alright BUT could be overriden in a license match and therefore licensematch should also have one IMHO
  10. other unknown license references which are references to files/websites/packages and could not be resolved successfully agreed
rspier commented 2 years ago

treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

This seems like a recipe for noise in the output. Or possibly the need for more granular levels of clue. (strong clue / weak clue). But I would probably lean towards "if it isn't actually useful signal, it's not interesting". What are you going to do with the clues once you have them?

One of the challenges with these heuristics is context, or the lack thereof.

I had a case a few weeks ago where https://github.com/svaarala/duktape/blob/master/website/index/index.html got scanned.

It contains

Similar engines
There are multiple Javascript engines targeting similar use cases as Duktape, at least:

[Espruino](https://github.com/espruino/Espruino) (MPL v2.0)
[JerryScript](http://jerryscript.net/) (Apache License v2.0)
[MuJS](http://mujs.com/) (Affero GPL)
[quad-wheel](https://code.google.com/p/quad-wheel/) (MIT License)
[QuickJS](https://bellard.org/quickjs/) (MIT License)
[tiny-js](https://github.com/gfwilliams/tiny-js) (MIT license)
[v7](https://github.com/cesanta/v7) (GPL v2.0)

Triggering off license name results in false positives for Duktape, even though this section is actually talking about other products.

This particular example is more complicated/subtle than most of the other examples in this bug, so might be a distraction, but it's still interesting.