Open pombredanne opened 2 years ago
@pombredanne I have a couple of possible cases here which could be clues, as opposed to detections. what do you think?
i.e. they will have is_clue as True
in their .yml files and will be reported in license_clues
, and not at license_detections
.
Cases where is_clue = True:
unknown
like: https://raw.github.com/markwoodhall/MFlow/master/license.txtmd5=1a6d268fd218675ffea8be556788b780"
is lgpl-2.1gpl
be also included in this?): like PSFL
Cases where is_clue is False: i.e. these are valid detections
Also attaching a csv file with a subset of the rules (is_license_reference = True and relevance < 100): clues_possible.csv
This makes 100% sense... we have to thread lightly though..
https://licenses.nuget.org/(LGPL-2.0-only WITH FLTK-exception OR Apache-2.0+)
would need to be detected possibly with a new matcher or by extending the matcher for SPDX license identifiers and would in all cases not be a mere clue in a rule unknown
like: https://raw.github.com/markwoodhall/MFlow/master/license.txt it depends. In many cases these are well know repos with stable licensing... another possibility could be to have a step to fetch things at the URL end and detect that instead ... but that out of scope for this issue ;) md5=1a6d268fd218675ffea8be556788b780"
is lgpl-2.1 this is borderline and could be a proper rule rather than a clue... some thinking neededgpl
be also included in this?): like PSFL
: agreed. For the GPL one I think we would need to have a special post-matching processing possibly looking at case and mixed case... which BTW would mean that the is_clue
is an attribute of a license rule alright BUT could be overriden in a license match and therefore licensematch should also have one IMHOtreat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection
This seems like a recipe for noise in the output. Or possibly the need for more granular levels of clue. (strong clue / weak clue). But I would probably lean towards "if it isn't actually useful signal, it's not interesting". What are you going to do with the clues once you have them?
One of the challenges with these heuristics is context, or the lack thereof.
I had a case a few weeks ago where https://github.com/svaarala/duktape/blob/master/website/index/index.html got scanned.
It contains
Similar engines
There are multiple Javascript engines targeting similar use cases as Duktape, at least:
[Espruino](https://github.com/espruino/Espruino) (MPL v2.0)
[JerryScript](http://jerryscript.net/) (Apache License v2.0)
[MuJS](http://mujs.com/) (Affero GPL)
[quad-wheel](https://code.google.com/p/quad-wheel/) (MIT License)
[QuickJS](https://bellard.org/quickjs/) (MIT License)
[tiny-js](https://github.com/gfwilliams/tiny-js) (MIT license)
[v7](https://github.com/cesanta/v7) (GPL v2.0)
Triggering off license name results in false positives for Duktape, even though this section is actually talking about other products.
This particular example is more complicated/subtle than most of the other examples in this bug, so might be a distraction, but it's still interesting.
We are detecting an AGPL with
agpl-3.0-plus_152.RULE
and this texthttp://www.ghostscript.com
... for instance from https://github.com/ReactiveX/rxjs/blob/6.x/README.mdThis is noisy.
There are two ways out: