Closed SimonGreenhill closed 7 years ago
They won't solve all problems, but we should include them in our pre-screening process of normalization. Although I tend to encode more sounds as "alias" than capture them by normalization (e.g., "th" is an alias for "t+superscript_h"), I think normalizing those ugly identical characters is a must, things like capital glottal stop, the one you show there, which is also ugly (and caused me some pain in the past), etc.
I think also, starting from lingpy's collection which has not been done so far but also lists some of the problematic characers, is a good idea.
So this will go into the initial normalization, where we should think of changing the format, as it is just two columns by now:
Just for reference, we have now moved this to clpn/multicode.
with "multicode" and the new normalization procedure, we can easily plug this in, although for now, I think the current list we have is enough, but should be considered for later versions.
Could the unicode specification for 'confusables' help us resolve some matches?
http://www.unicode.org/Public/security/9.0.0/confusables.txt
e.g. some I could see happening are: