cldf-clts / clts-legacy

Cross-Linguistic Transcription Systems
Apache License 2.0
4 stars 3 forks source link

Unicode 'confusables' list? #6

Closed SimonGreenhill closed 7 years ago

SimonGreenhill commented 7 years ago

Could the unicode specification for 'confusables' help us resolve some matches?

http://www.unicode.org/Public/security/9.0.0/confusables.txt

e.g. some I could see happening are:

02A4 ;  0064 021D ; MA  # ( ʤ → dȝ ) LATIN SMALL LETTER DEZH DIGRAPH → LATIN SMALL LETTER D, LATIN SMALL LETTER YOGH    # →dʒ→
04D9 ;  01DD ;  MA  # ( ә → ǝ ) CYRILLIC SMALL LETTER SCHWA → LATIN SMALL LETTER TURNED E   # 
LinguList commented 7 years ago

They won't solve all problems, but we should include them in our pre-screening process of normalization. Although I tend to encode more sounds as "alias" than capture them by normalization (e.g., "th" is an alias for "t+superscript_h"), I think normalizing those ugly identical characters is a must, things like capital glottal stop, the one you show there, which is also ugly (and caused me some pain in the past), etc.

I think also, starting from lingpy's collection which has not been done so far but also lists some of the problematic characers, is a good idea.

So this will go into the initial normalization, where we should think of changing the format, as it is just two columns by now:

LinguList commented 7 years ago

Just for reference, we have now moved this to clpn/multicode.

LinguList commented 7 years ago

with "multicode" and the new normalization procedure, we can easily plug this in, although for now, I think the current list we have is enough, but should be considered for later versions.