Open xu1119 opened 3 years ago
@xu1119 Thanks! @AyanSinhaMahapatra would your new plugin be able to spot this?
@pombredanne No (and Yes). But it should have definitely, so this was a good find.
So in, most of the false positives I got before, the common factor was that rule_length
was 1, as in it got matched to a very simple rule having just the name of the license, like just gpl
. But this got matched to gpl-1.0_15.RULE
for which the text is gpl 1
.
So now the preliminary step to separate probable false-positives was, "is_license_tag" == true and "rule_length" == 1
as here, and then run it through a classifier to determine that more accurately.
We definitely need to set in place a more explicit step, by going through all the scancode license_tag
rules, and see which ones have the potential to be matched to become a false_positive
and then either increase these "rule_length" criteria for these cases to be correctly analyzed too or even maintain a set
of rules which can generate potential false positives
, adding a ticket now and doing the same.
The sentence classifier step, i.e. the false_positive
vs license_tag
NLP classifier does correctly detect this. So, that works. The prelim step to only take out matches with "rule_length" == 1
was done because the assumption was, false positives are generated from only these rules, so we don't have to pass all the license_tag
matches through the classifier. But there's clearly exceptions to this assumption, like this case here, and we should be able to detect that.
Thanks @xu1119
Also @pombredanne there's a ticket open for an extra heuristic you suggested, here at nexB/scancode-results-analyzer#29, implementing this (without the single-word
, making things more explicit here as discussed above) also would be able to detect this, since the "start_line": 1606
.
When trying to scan this file with latest scancode, It get the following license : file from https://github.com/zyq8709/DexHunter/blob/master/dalvik/vm/compiler/codegen/x86/AnalysisO1.cpp
Description
Source code wrongly detected as gpl-1.0
How To Reproduce
scancode -li --license-text --json-pp - AnalysisO1.cpp
System configuration