Spurious detection of CeCILL and GPL in French translation of GFDL

richardfontana commented 7 years ago

In gnome-desktop 3.14.2 the file desktop-docs/fdl/fr/index.docbook contains what I believe is a French translation of the GNU FDL. ScanCode detects GPL and CeCILL in this text. index.docbook-scancode.txt

pombredanne commented 7 years ago

@richardfontana thanks!

pombredanne commented 7 years ago

License translations and non-English licenses are interesting cases: because they are reasonably rare at scale, they introduce subtle bias that often lead to some somewhat false positive detections (even though at very low scores) of legalese in that language. Also because the words used in these licenses are also much less frequent than the words used in English licenses, they tend to be given more prominence the by detection engine machinery.

Here for instance we have to CeCILL rules that are matched but with a very low "coverage", e.g. very few words of the original rule text were matched:

          "matched_rule": {
            "identifier": "cecill-2.0_2.RULE",
[...]
            "matched_length": 16,
            "match_coverage": 0.46,
            "rule_relevance": 100
          },

and:

          "matched_rule": {
            "identifier": "cecill-2.0_2.RULE",
[...]
            "matched_length": 5,
            "match_coverage": 0.14,

These are very few words BUT interestingly enough there are two cases:

some matched words are very frequent words in French
or the matched words are French legalese

So a resolution is going to comprise all of these:

set a minimum numbers of words to be matched (e.g. a minimum coverage) for these CeCILL rules, say at least 10% of the words. And eventually set the same on the other non-English licenses.
add a new rule specifically for this translation of the GFDL, either in the raw docbook variant or in the plain cleaned text variant (the later is likely better). This will take care of getting a proper, high coverage match to this docbook file and handle the false positive detection of the GPL too.
add a new set of "frequent words" for French. Eventually we will track a list of these common words for each and every language and will split this list from the English frequent words. I will need to find later per-language lists of these "stop-word"-like words.

Note also that, as part of #139 we will add a language attribute so we can know what is the language that was matched for a license or license rule. This will help for instance with the addition of all the CC licenses translations as part of #514.

pombredanne commented 7 years ago

BTW, the spurious GPL match is because of the FSF address: "matched_text": "51 Franklin [Street], Fifth Floor</[street]>, \n <[city]>Boston"

pombredanne commented 7 years ago

So I am adding a new GFDL 1.1 detection rule with the plain text version of the docbook file. I am using @jgm excellent pandoc for the conversion: $pandoc -f docbook -t plain -o index.txt index.docbook

pombredanne commented 7 years ago

FWIW, these GNOME licenses are a treasure trove of quality markup: the docbook texts have hyperlinks to the essential conditions of the license. Quite a find and a great work that could have many reuse, e.g. helping with the legal review of a text.

pombredanne commented 7 years ago

@richardfontana The latest code in the develop branch fixes your bug and is ready for your review. I am tracking other related foreign language licenses improvements in #139 Thanks++ for this report: please continue sending our way any oddity you would find. Any other feedback for improvements is welcomed too of course. BTW, do you run ScanCode on your desktop or on a server?

richardfontana commented 7 years ago

@pombredanne I have been running ScanCode on my laptop.

richardfontana commented 7 years ago

Closing this as it was fixed.

aboutcode-org / scancode-toolkit

Spurious detection of CeCILL and GPL in French translation of GFDL #553