Closed richardfontana closed 7 years ago
@richardfontana thanks!
License translations and non-English licenses are interesting cases: because they are reasonably rare at scale, they introduce subtle bias that often lead to some somewhat false positive detections (even though at very low scores) of legalese in that language. Also because the words used in these licenses are also much less frequent than the words used in English licenses, they tend to be given more prominence the by detection engine machinery.
Here for instance we have to CeCILL rules that are matched but with a very low "coverage", e.g. very few words of the original rule text were matched:
"matched_rule": {
"identifier": "cecill-2.0_2.RULE",
[...]
"matched_length": 16,
"match_coverage": 0.46,
"rule_relevance": 100
},
and:
"matched_rule": {
"identifier": "cecill-2.0_2.RULE",
[...]
"matched_length": 5,
"match_coverage": 0.14,
These are very few words BUT interestingly enough there are two cases:
So a resolution is going to comprise all of these:
set a minimum numbers of words to be matched (e.g. a minimum coverage) for these CeCILL rules, say at least 10% of the words. And eventually set the same on the other non-English licenses.
add a new rule specifically for this translation of the GFDL, either in the raw docbook variant or in the plain cleaned text variant (the later is likely better). This will take care of getting a proper, high coverage match to this docbook file and handle the false positive detection of the GPL too.
add a new set of "frequent words" for French. Eventually we will track a list of these common words for each and every language and will split this list from the English frequent words. I will need to find later per-language lists of these "stop-word"-like words.
Note also that, as part of #139 we will add a language attribute so we can know what is the language that was matched for a license or license rule. This will help for instance with the addition of all the CC licenses translations as part of #514.
BTW, the spurious GPL match is because of the FSF address:
"matched_text": "51 Franklin [Street], Fifth Floor</[street]>, \n <[city]>Boston"
So I am adding a new GFDL 1.1 detection rule with the plain text version of the docbook file. I am using @jgm excellent pandoc for the conversion:
$pandoc -f docbook -t plain -o index.txt index.docbook
FWIW, these GNOME licenses are a treasure trove of quality markup: the docbook texts have hyperlinks to the essential conditions of the license. Quite a find and a great work that could have many reuse, e.g. helping with the legal review of a text.
@richardfontana The latest code in the develop
branch fixes your bug and is ready for your review. I am tracking other related foreign language licenses improvements in #139
Thanks++ for this report: please continue sending our way any oddity you would find. Any other feedback for improvements is welcomed too of course.
BTW, do you run ScanCode on your desktop or on a server?
@pombredanne I have been running ScanCode on my laptop.
Closing this as it was fixed.
In gnome-desktop 3.14.2 the file desktop-docs/fdl/fr/index.docbook contains what I believe is a French translation of the GNU FDL. ScanCode detects GPL and CeCILL in this text. index.docbook-scancode.txt