codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.92k stars 467 forks source link

Filter homographs (or homoglyphs) #2007

Open arm-in opened 3 years ago

arm-in commented 3 years ago

There exist certain letters in Latin, Greek, Cyrillic, etc. that look the same, but have a different representation in Unicode.

The dictionaries of codespell are British English or American English, whatsoever. ASCII should be enough. Arguable, some words of French or Spanish origin might have accents. https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin https://en.wikipedia.org/wiki/List_of_English_words_of_Spanish_origin

To my knowledge, there should be no cyrillic or greek letters contained in any English word. Instead of adding lots of words in the dictionary with bogus letters, there should a filter in codespell to indicate / correct those.

Explanation https://en.wikipedia.org/wiki/IDN_homograph_attack Unicode Table https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024&number=128&names=-

Codespell Example see #2001

DimitriPapadopoulos commented 2 years ago

Python module homoglyphs may help generate the list of homoglyphs.

Because the above module is not maintained anymore, creating the list of homoglyphs at build time seems good enough, faster and simpler than run time.

arm-in commented 2 years ago

Good catch! This may save quite some effort.

DimitriPapadopoulos commented 2 years ago

Raising an alert on words that contain both ASCII and non-Latin characters might be a good idea. I cannot imagine a common use for such words, except software precisely dealing with homoglyphs.

On the other hand, raising an alert on words with both ASCII and non-ASCII Latin characters might ba a bad idea. I bet many source files out there contain words written with both ASCII and non-ASCII Latin characters (accented characters), at the very least author names: Günter, Zoë, Édouard, Tomáš, etc.

Here is another maintained Python module, with information on the origin of the list of homoglyphs (the Unicode consortium itself):

DimitriPapadopoulos commented 2 years ago

Thinking about it again, we wouldn't ba able to "fix" non-Latin characters that are not in the list of homoglyphs, we just need to look for words that contain both ASCII characters and one of the homoglyphs.