False positive AGPL detection from a mere URL

pombredanne commented 2 years ago

We are detecting an AGPL with agpl-3.0-plus_152.RULE and this text http://www.ghostscript.com ... for instance from https://github.com/ReactiveX/rxjs/blob/6.x/README.md

This is noisy.

There are two ways out:

remove these short URL and related rules since they are not enough of their own to be a license detection, or
treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

AyanSinhaMahapatra commented 2 years ago

@pombredanne I have a couple of possible cases here which could be clues, as opposed to detections. what do you think?

i.e. they will have is_clue as True in their .yml files and will be reported in license_clues, and not at license_detections.

Cases where is_clue = True:

urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license
links to websites like ghostscript like above
unknown references like http://licenses.nuget.org/
links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider
links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt
references/words which are know to be a license, like: mupdf or ghostscript or Affero
words that are license names but generic (?): like beerware or borceux
hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1
abbreviations of licenses (could tags like gpl be also included in this?): like PSFL
other unknown license references which are references to files/websites/packages and could not be resolved successfully

Cases where is_clue is False: i.e. these are valid detections

link to license texts or specific licenses https://spdx.org/licenses/bsd-2-clause

Also attaching a csv file with a subset of the rules (is_license_reference = True and relevance < 100): clues_possible.csv

pombredanne commented 2 years ago

This makes 100% sense... we have to thread lightly though..

urls to github/other code repo licenses: example: http://github.com/dotnetcore/Util/blob/master/LICENSE or https://devshed.codeplex.com/license : ==> IMHO several are bona fide detection rules not mere clues.
links to websites like ghostscript like above: this is the case for mere clues
unknown references like http://licenses.nuget.org/ for this bare URL, likely yes, but https://licenses.nuget.org/(LGPL-2.0-only WITH FLTK-exception OR Apache-2.0+) would need to be detected possibly with a new matcher or by extending the matcher for SPDX license identifiers and would in all cases not be a mere clue in a rule
links to github repos (and not licenses) like https://github.com/micahlmartin/OAuth2Provider agreed
links to github licenses which are unknown like: https://raw.github.com/markwoodhall/MFlow/master/license.txt it depends. In many cases these are well know repos with stable licensing... another possibility could be to have a step to fetch things at the URL end and detect that instead ... but that out of scope for this issue ;)
references/words which are know to be a license, like: mupdf or ghostscript or Affero agreed
words that are license names but generic (?): like beerware or borceux beerware is surely a proper rule and not a mere clue, borceux would be clue alright ... so it really depends
hash values: example: md5=1a6d268fd218675ffea8be556788b780" is lgpl-2.1 this is borderline and could be a proper rule rather than a clue... some thinking needed
abbreviations of licenses (could tags like gpl be also included in this?): like PSFL : agreed. For the GPL one I think we would need to have a special post-matching processing possibly looking at case and mixed case... which BTW would mean that the is_clue is an attribute of a license rule alright BUT could be overriden in a license match and therefore licensematch should also have one IMHO
other unknown license references which are references to files/websites/packages and could not be resolved successfully agreed

rspier commented 2 years ago

treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

This seems like a recipe for noise in the output. Or possibly the need for more granular levels of clue. (strong clue / weak clue). But I would probably lean towards "if it isn't actually useful signal, it's not interesting". What are you going to do with the clues once you have them?

One of the challenges with these heuristics is context, or the lack thereof.

I had a case a few weeks ago where https://github.com/svaarala/duktape/blob/master/website/index/index.html got scanned.

It contains

Similar engines
There are multiple Javascript engines targeting similar use cases as Duktape, at least:

[Espruino](https://github.com/espruino/Espruino) (MPL v2.0)
[JerryScript](http://jerryscript.net/) (Apache License v2.0)
[MuJS](http://mujs.com/) (Affero GPL)
[quad-wheel](https://code.google.com/p/quad-wheel/) (MIT License)
[QuickJS](https://bellard.org/quickjs/) (MIT License)
[tiny-js](https://github.com/gfwilliams/tiny-js) (MIT license)
[v7](https://github.com/cesanta/v7) (GPL v2.0)

Triggering off license name results in false positives for Duktape, even though this section is actually talking about other products.

This particular example is more complicated/subtle than most of the other examples in this bug, so might be a distraction, but it's still interesting.

aboutcode-org / scancode-toolkit

False positive AGPL detection from a mere URL #2877