Open arm-in opened 3 years ago
Python module homoglyphs
may help generate the list of homoglyphs.
Because the above module is not maintained anymore, creating the list of homoglyphs at build time seems good enough, faster and simpler than run time.
Good catch! This may save quite some effort.
Raising an alert on words that contain both ASCII and non-Latin characters might be a good idea. I cannot imagine a common use for such words, except software precisely dealing with homoglyphs.
On the other hand, raising an alert on words with both ASCII and non-ASCII Latin characters might ba a bad idea. I bet many source files out there contain words written with both ASCII and non-ASCII Latin characters (accented characters), at the very least author names: Günter
, Zoë
, Édouard
, Tomáš
, etc.
Here is another maintained Python module, with information on the origin of the list of homoglyphs (the Unicode consortium itself):
Thinking about it again, we wouldn't ba able to "fix" non-Latin characters that are not in the list of homoglyphs, we just need to look for words that contain both ASCII characters and one of the homoglyphs.
There exist certain letters in Latin, Greek, Cyrillic, etc. that look the same, but have a different representation in Unicode.
The dictionaries of codespell are British English or American English, whatsoever. ASCII should be enough. Arguable, some words of French or Spanish origin might have accents. https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin https://en.wikipedia.org/wiki/List_of_English_words_of_Spanish_origin
To my knowledge, there should be no cyrillic or greek letters contained in any English word. Instead of adding lots of words in the dictionary with bogus letters, there should a filter in codespell to indicate / correct those.
Explanation https://en.wikipedia.org/wiki/IDN_homograph_attack Unicode Table https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024&number=128&names=-
Codespell Example see #2001