Open Yorwba opened 5 years ago
And GitHub has a character limit on the issue text, so here's the second part:
[ ] ß → ss í → i̇́ İ → i̇ ẞ → ss Affects: Danish [dan], Kölsch [ksh], Swabian [swg], Low German (Low Saxon) [nds], Esperanto [epo], Crimean Tatar [crh], Venetian [vec], Galician [glg], Unknown Language, Latin [lat], Hungarian [hun], Arabic [ara], Hindi [hin], Bavarian [bar], French [fra], Hebrew [heb], Italian [ita], Ottoman Turkish [ota], Czech [ces], Mandarin Chinese [cmn], Japanese [jpn], Dutch [nld], Turkish [tur], Ido [ido], Slovenian [slv], Talossan [tzl], Finnish [fin], Berber [ber], Afrikaans [afr], English [eng], Spanish [spa], Lithuanian [lit], Talysh [tly], Zaza [zza], Russian [rus], Kabyle [kab], Interlingua [ina], Polish [pol], Portuguese [por], Toki Pona [toki], German [deu], Basque [eus], Tatar [tat]
[ ] ᾄ → ἄι ᾐ → ἠι ᾔ → ἤι ᾕ → ἥι ᾖ → ἦι ᾗ → ἧι ᾧ → ὧι ᾳ → αι ᾷ → αι ᾷ → ᾶι ῂ → ὴι ῃ → ηι ῄ → ήι ῇ → ῆι ῞ → ̔́ ῳ → ωι ῴ → ώι ῷ → ῶι Affects: Ancient Greek [grc], Greek [ell]
[ ] ﷺ → صلىاللهعليهوسلم Affects: Turkish [tur]
[ ] № → no ™ → tm Affects: Russian [rus], Kazakh [kaz], Belarusian [bel], Bulgarian [bul], Meadow Mari [mhr], French [fra], Tatar [tat], English [eng], Spanish [spa]
[ ] ¼ → 14 ½ → 12 ⅓ → 13 Affects: Danish [dan], English [eng], German [deu]
[ ] Mή → mη Ẹ̀ → ẹ̀ Άι → αϊ Άσ → ας Έί → εϊ Έι → εϊ Βή → βη Ζή → ζη Λή → λη Μή → μη Μῆ → μη Νή → νη Πή → πη Ρή → ρη Σή → ση Τὴ → τη Χή → χη Ψή → ψη άι → αϊ άσ → ας άυ → αϋ έι → εϊ έσ → ες έυ → εϋ ήµ → ημ ήι → ηϊ ίσ → ις όι → οϊ ύι → υϊ ώι → ωϊ ώσ → ως ᾷς → ᾶις ῃς → ηις ῇς → ῆις Affects: Ancient Greek [grc], Greek [ell]
[x] Ά → ά · → έ Έ → ή Ή → ί Ό → ό Ύ → ώ Affects: Greek [ell]
[x] J → i U → v W → v j → i u → v w → v á → a é → e í → i ó → o Ā → a ā → a Ē → e ē → e ĕ → e Ī → i ī → i ĭ → i ō → o Ū → v ū → v Affects: Latin [lat]
[x] ł → Ń Affects: Polish [pol], Navajo [nav], Lower Sorbian [dsb], Esperanto [epo], Mandarin Chinese [cmn], Upper Sorbian [hsb], German [deu], English [eng], Belarusian [bel], Kashubian [csb], Dutch [nld], Danish [dan], Bavarian [bar], Hungarian [hun], Spanish [spa], Berber [ber], Slovak [slk], Portuguese [por], Italian [ita], Indonesian [ind]
[x] ם → מ ף → פ Affects: Hebrew [heb], Yiddish [yid]
[x] ņ → Ň Affects: Esperanto [epo], Lithuanian [lit], Latvian [lvs], English [eng], French [fra], Livonian [liv], Unknown Language, Portuguese [por], Italian [ita]
[x] H → ' h → ' Affects: Lojban [jbo]
[x] ĺ → Ļ Affects: Spanish [spa], Slovak [slk], Danish [dan], Hungarian [hun]
[x] ľ → Ŀ Affects: Czech [ces], Slovak [slk], Veps [vep], Romani [rom]
[x] Â → a â → a î → ı û → u Affects: Turkish [tur]
[x] Ơ → ơ Affects: Vietnamese [vie]
[x] ך → כ Affects: Hebrew [heb], Yiddish [yid], Old Aramaic [oar]
[x] È → è Affects: Yoruba [yor]
[x] ń → Ņ Affects: Polish [pol], Wolof [wol], Lower Sorbian [dsb], Esperanto [epo], Upper Sorbian [hsb], Yoruba [yor], German [deu], English [eng], Belarusian [bel], Hungarian [hun], Spanish [spa], Berber [ber], Slovak [slk]
[x] ץ → צ Affects: English [eng], Hebrew [heb], Yiddish [yid]
[x] ן → נ Affects: English [eng], Yiddish [yid], Old Aramaic [oar], Hebrew [heb], Ladino [lad]
[x] ļ → Ľ Affects: Unknown Language, Lithuanian [lit], Livonian [liv], Latvian [lvs]
[x] ň → ʼn Affects: Czech [ces], Slovak [slk], Turkmen [tuk], Romani [rom]
[ ] I → i Affects: Azerbaijani [aze]
[ ] İ → ı Affects: Ottoman Turkish [ota], Zaza [zza], Talysh [tly], Tatar [tat], English [eng], Azerbaijani [aze], Ido [ido], Dutch [nld], Venetian [vec], Crimean Tatar [crh]
[x] ՛ ՜ ՝ ՞ ՟ ։ Affects: Armenian [hye]
[ ] 〈 〉 【 】 〔 〕 〜 Affects: Japanese [jpn]
[ ] 「 」 Affects: Mandarin Chinese [cmn], Cantonese [yue], Japanese [jpn], Ainu [ain], Korean [kor], Literary Chinese [lzh], Russian [rus], Shanghainese [wuu]
[ ] 』 Affects: Ancient Greek [grc], Mandarin Chinese [cmn], Cantonese [yue], Japanese [jpn], Literary Chinese [lzh]
[ ] (IDEOGRAPHIC SPACE) Affects: Mandarin Chinese [cmn], German [deu], English [eng], Japanese [jpn], Turkish [tur], Ainu [ain], Literary Chinese [lzh]
[ ] _ Affects: Polish [pol], Finnish [fin], Uyghur [uig], English [eng], Japanese [jpn], Dutch [nld], Spanish [spa], Portuguese [por], Russian [rus], Bulgarian [bul], Esperanto [epo], Macedonian [mkd], Swedish [swe], Tatar [tat], Hungarian [hun], Italian [ita], Arabic [ara], Mandarin Chinese [cmn], German [deu], French [fra], Czech [ces], Berber [ber], Georgian [kat], Serbian [srp], Kabyle [kab], Belarusian [bel], Basque [eus], Turkish [tur]
[ ] 『 Affects: Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn], Cantonese [yue]
[ ] 《 》 Affects: Literary Chinese [lzh], Mandarin Chinese [cmn], Cantonese [yue], Shanghainese [wuu]
[ ] ・ Affects: English [eng], Japanese [jpn]
[ ] 。 Affects: Hakka Chinese [hak], Xiang Chinese [hsn], Bulgarian [bul], Chinese (Jin) [cjy], Mandarin Chinese [cmn], Cantonese [yue], Japanese [jpn], Sumerian [sux], Gan Chinese [gan], Irish [gle], Ainu [ain], Korean [kor], Literary Chinese [lzh], Lojban [jbo], Min Nan Chinese [nan], Chavacano [cbk], Italian [ita], Shanghainese [wuu]
[ ] $ Affects: Polish [pol], Finnish [fin], Marathi [mar], Lingua Franca Nova [lfn], Bengali [ben], CycL [cycl], English [eng], Japanese [jpn], Hindi [hin], Ilocano [ilo], Dutch [nld], Danish [dan], Spanish [spa], Portuguese [por], Russian [rus], Turkmen [tuk], Maltese [mlt], Esperanto [epo], Ukrainian [ukr], Tagalog [tgl], Hebrew [heb], Italian [ita], Catalan [cat], Greek [ell], German [deu], French [fra], Romanian [ron], Berber [ber], Interlingua [ina], Estonian [est], Georgian [kat], Kabyle [kab], Belarusian [bel], Turkish [tur], Indonesian [ind]
[ ] ၌ ၍ ၏ Affects: Burmese [mya]
[x] ' Affects: Lojban [jbo]
[ ] 、 Affects: Mandarin Chinese [cmn], Cantonese [yue], Japanese [jpn], Ainu [ain], Spanish [spa], Literary Chinese [lzh], Italian [ita], Shanghainese [wuu]
[ ] · Affects: Greek [ell]
And the third:
[ ] ؠ ً ٌ ٍ َ ُ ِ ّ ْ ٓ ٔ ٕ ٖ ٗ ٘ ٚ ٛ ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ٰ ۜ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ Affects: Algerian Arabic [arq], Egyptian Arabic [arz], Unknown Language, Swedish [swe], Iraqi Arabic [acm], Hungarian [hun], Persian [pes], Arabic [ara], Ottoman Turkish [ota], German [deu], North Levantine Arabic [apc], Tatar [tat], Gulf Arabic [afb], English [eng], Urdu [urd], Kashmiri [kas], Punjabi (Western) [pnb]
[ ] à ã è ê ë ì ï ð ñ ò ô ù ü ý Affects: Latin [lat], Turkish [tur]
[x] ἂ ἃ ἅ ἣ ἳ ἴ ἵ ἷ ὓ ὖ ὗ ὢ ὣ ὥ ᾱ ῑ ῡ ῥ Affects: Ancient Greek [grc]
[x] ҃ ꙗ Affects: Old East Slavic [orv]
[x] 𒀀 𒀉 𒀊 𒀕 𒀖 𒀜 𒀝 𒀠 𒀪 𒀭 𒀲 𒀳 𒀴 𒀸 𒀾 𒁀 𒁄 𒁇 𒁉 𒁍 𒁕 𒁮 𒁯 𒁲 𒁳 𒁶 𒁹 𒁺 𒁻 𒁾 𒂊 𒂍 𒂗 𒂠 𒂦 𒂵 𒂷 𒂼 𒃮 𒃲 𒃶 𒃸 𒃻 𒄀 𒄄 𒄑 𒄘 𒄠 𒄢 𒄦 𒄨 𒄩 𒄭 𒄯 𒄰 𒄴 𒄷 𒄾 𒄿 𒅁 𒅅 𒅆 𒅇 𒅍 𒅎 𒅔 𒅗 𒅘 𒅥 𒅴 𒆕 𒆗 𒆜 𒆟 𒆠 𒆪 𒆬 𒆳 𒆷 𒇇 𒇉 𒇯 𒇳 𒇴 𒇷 𒇻 𒇽 𒈜 𒈝 𒈠 𒈣 𒈤 𒈧 𒈨 𒈪 𒈫 𒈬 𒈭 𒈾 𒉆 𒉈 𒉌 𒉘 𒉡 𒉪 𒉺 𒉽 𒉿 𒊏 𒊑 𒊒 𒊕 𒊩 𒊬 𒊭 𒊮 𒊷 𒋀 𒋃 𒋗 𒋛 𒋢 𒋤 𒋧 𒋫 𒋺 𒋻 𒋼 𒋾 𒌀 𒌅 𒌆 𒌇 𒌈 𒌉 𒌋 𒌌 𒌍 𒌒 𒌓 𒌝 𒌤 𒌦 𒌨 𒌶 𒌷 𒍂 𒍇 𒍜 𒍝 𒍠 𒍢 𒍣 𒍪 𒍼 𒐈 𒐊 𒐋 𒐼 𒑂 𒑄 𒑆 𒑏 Affects: Unknown Language, Sumerian [sux]
[ ] ֑ ֔ ֖ ֗ ֘ ֙ ֝ ֡ ֣ ֤ ֥ ֨ ֪ ְ ֱ ֲ ֳ ִ ֵ ֶ ַ ָ ֹ ֻ ּ ֽ ֿ ׁ ׂ ׇ ﬞ Affects: Unknown Language, Old Aramaic [oar], Hebrew [heb], Jewish Babylonian Aramaic [tmr], Yiddish [yid], English [eng], Ladino [lad]
[ ] ៗ ៝ Affects: Central Mnong [cmo], Khmer [khm]
[x] ຽ ໆ Affects: Lao [lao]
[x] ʻ ʼ ʿ ˀ ˈ ˌ ː Affects: English [eng], Tongan [ton], Spanish [spa], Russian [rus], Esperanto [epo], Ngeq [ngt], Hawaiian [haw], Ukrainian [ukr], Hebrew [heb], Italian [ita], Samoan [smo], Cayuga [cay], Breton [bre], Tahitian [tah], German [deu], French [fra], Uzbek [uzb], Navajo [nav], Ancient Greek [grc], Kabyle [kab], Niuean [niu], Belarusian [bel]
[x] ൺ ൻ ർ ൽ ൾ Affects: Malayalam [mal]
[x] ᠠ ᠨ ᠩ ᠪ ᠮ ᠰ ᡝ ᡠ ᡤ ᡥ ᡩ ᡳ ᡵ Affects: Manchu [mnc]
[x] 𑣁 𑣂 𑣅 𑣈 𑣋 𑣌 𑣓 𑣖 𑣗 𑣘 𑣙 𑣜 Affects: Ho [hoc]
[x] 𐰀 𐰃 𐰆 𐰇 𐰉 𐰋 𐰍 𐰓 𐰕 𐰖 𐰘 𐰚 𐰞 𐰢 𐰣 𐰲 𐰸 𐰺 𐰼 𐰾 𐱃 𐱅 Affects: Old Turkish [otk]
[x] 𐌰 𐌱 𐌲 𐌳 𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍂 𐍃 𐍄 𐍅 𐍆 𐍈 𐍉 Affects: Gothic [got]
[x] ꦁ ꦂ ꦃ ꦏ ꦒ ꦔ ꦕ ꦗ ꦚ ꦠ ꦡ ꦢ ꦣ ꦤ ꦥ ꦧ ꦩ ꦪ ꦫ ꦭ ꦮ ꦰ ꦱ ꦲ ꦴ ꦶ ꦸ ꦺ ꦼ ꧀ Affects: Javanese [jav]
[x] ꀁ ꀃ ꀐ ꀕ ꁧ ꂘ ꂯ ꃀ ꆍ ꆏ ꆹ ꇩ ꇬ ꇿ ꈍ ꉡ ꉬ ꊿ ꋋ ꋙ ꋠ ꌕ ꍏ ꏃ ꐥ ꑋ ꑍ ꑬ Affects: Unknown Language
[x] ㇰ ㇱ ㇷ ㇻ ㇼ ㇽ ㇾ ㇿ Affects: Ainu [ain]
[ ] ︎ Affects: Japanese [jpn]
[ ] ᜀ ᜃ ᜄ ᜅ ᜆ ᜇ ᜈ ᜉ ᜊ ᜋ ᜌ ᜎ ᜏ ᜐ ᜑ ᜒ ᜓ ᜔ Affects: Tagalog [tgl]
[ ] 𓀀 𓀁 𓀻 𓁐 𓂋 𓂜 𓂸 𓂻 𓃀 𓄖 𓄤 𓄿 𓅓 𓅨 𓅱 𓆑 𓆓 𓆼 𓇋 𓇌 𓇛 𓈖 𓈗 𓈞 𓊄 𓊪 𓋴 𓌻 𓍿 𓎛 𓎟 𓎡 𓎢 𓏌 𓏏 𓏥 𓏫 𓏭 𓏲 𓏶 Affects: Unknown Language
[x] (SOFT HYPHEN) ́ Affects: Russian [rus], Latin [lat]
[x] (SOFT HYPHEN) ְ ֱ ֲ ֳ ִ ֵ ֶ ַ ָ ֹ ֺ ֻ ּ ֽ ־ ֿ ׀ ׁ ׂ ׃ ׄ ׅ ׇ Affects: All Languages [all]
The amount of work you've put into researching this is truly stunning. Are you planning to implement any changes regarding these suggestions?
It's about time I started fixing issues rather than just piling them up.
Fortunately, this one only affects a well-delineated part of the code base, so I can work on it without having to figure out how the rest of it fits together. Well, except for everything involving multiple codepoints, which will require Unicode normalization (either NFC or NFKC) to happen at some point.
It's great that you're planning to do this work yourself. If you were going to ask someone else to do it, you would probably have to break it up and/or scale it down.
@Yorwba What’s the status of this issue? Are there some remaining mappings that you want to implement? Is the status of checkboxes relevant?
By the way, Andreas already Unicode-normalized the sentences, but I think we still need to normalize the query text that is sent to Manticore when searching.
The checkboxes reflect where I've created a PR or decided that the current behavior probably doesn't need to be changed.
I'd been working my way up the list of "Other Unsearchable Characters", because those were mostly scripts that were missing entirely. Once I hit the missing characters from the Arabic script, things got a bit complicated. I asked a few speakers of affected languages which behavior they'd prefer and got feedback regarding Arabic, Persian and Ottoman Turkish. In the case of Arabic vowel marks, the ideal behavior would be that a word with vowel marks matches one without, but not another word with different vowel marks. But that's not a transitive relation, so it can't be implemented with a simple index lookup. Also, the set of characters that are considered equivalent is different across different languages. Parts of this would probably best be handled by a stemmer. Since we now have stemming for Arabic, the situation should have improved a bit, but I need to check. See also issues #1595 (Arabic) and #1880 (Ottoman Turkish).
Unicode normalization to NFC would take care of all duplicate encodings in one fell swoop, but it seems like the cleaning function hasn't been applied to sentences already in the database. E.g. the Khách
in https://tatoeba.org/eng/sentences/show/7027190 still has an á
composed of two codepoints.
Then there's the issue of near duplicates that are only the same in NFKC. Those are considered canonical equivalents by Unicode, but may have different appearance and are sometimes used differently. Some of them also normalize to multiple codepoints, e.g. the Dutch ij
(one codepoint) vs. ij
(two codepoints). I don't think we can store sentences in NFKC, because that would erase too much information, but maybe a normalization step could be inserted somewhere in the Manticore pipeline. (Probably not that easy.)
Some of the other parts aren't that technically difficult, but I'm not sure what the best option is. E.g. punctuation is less problematic for languages that use the ngram_chars
mechanism, which is more robust to the presence of additional characters. Maybe some people actually want to be able to search for punctuation?
I do plan to eventually get this issue fully cleaned up, but I don't have a specific timeline planned.
I just started working with the data and stumbled about this unicode normalization problem. On the way I created a simple script that detects duplicate sentences. I hope this is helpful somehow.
#!/usr/bin/env bash
set -Eeuo pipefail # https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/#:~:text=set%20%2Du,is%20often%20highly%20desirable%20behavior.
set -x # print all commands
shopt -s expand_aliases
export LC_ALL=en_US.UTF-8
# https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
# https://www.effectiveperlprogramming.com/2011/09/normalize-your-perl-source/
alias nfd="perl -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"
# Normalize different unicode space characters to the same space
# https://stackoverflow.com/a/43640405
alias normalize_spaces="perl -CSDA -plE 's/[^\\S\\t]/ /g'"
function normalize_unicode() {
cat - | normalize_spaces | nfd
}
OUT="out"
TRANS_OUT="$OUT/translations"
mkdir -p $TRANS_OUT
# https://tatoeba.org (Multilingual collaborative sentence translation database)
# https://tatoeba.org/eng/downloads
(cd "$TRANS_OUT"; wget --no-verbose --show-progress --timestamping "https://downloads.tatoeba.org/exports/sentences.tar.bz2")
(cd "$TRANS_OUT"; wget --no-verbose --show-progress --timestamping "https://downloads.tatoeba.org/exports/links.tar.bz2")
[ -s "$TRANS_OUT/sentences.tsv" ] || (tar xOjf "$TRANS_OUT/sentences.tar.bz2" sentences.csv | normalize_unicode > "$TRANS_OUT/sentences.tsv")
[ -s "$TRANS_OUT/links.tsv" ] || (tar xOjf "$TRANS_OUT/links.tar.bz2" links.csv | normalize_unicode > "$TRANS_OUT/links.tsv")
SQLITEDB="$TRANS_OUT/translations.sqlite"
if [ ! -s "$TRANS_OUT/translations.sqlite" ]; then
# some sentences referenced by links might be invalid. That's ok, because some sentences were deduplicated, for example https://tatoeba.org/eng/sentences/show/3094
rm -f "$SQLITEDB"
cat << EOF | sqlite3 -batch "$SQLITEDB"
.bail on
PRAGMA foreign_keys = ON;
SELECT "importing all sentences...";
CREATE TABLE sentences(
sentenceid INTEGER NOT NULL PRIMARY KEY,
lang TEXT NOT NULL,
sentence TEXT NOT NULL
);
CREATE INDEX sentences_sentence_lang ON sentences (lang);
CREATE INDEX sentences_sentence_sentence ON sentences (sentence);
.mode ascii
.separator "\t" "\n"
.import '$TRANS_OUT/sentences.tsv' sentences
SELECT "importing all links...";
CREATE TABLE links(
sentenceid INTEGER NOT NULL,
translationid INTEGER NOT NULL,
PRIMARY KEY (sentenceid, translationid)
FOREIGN KEY (sentenceid)
REFERENCES sentences (sentenceid)
ON UPDATE CASCADE
ON DELETE CASCADE
FOREIGN KEY (translationid)
REFERENCES sentences (sentenceid)
ON UPDATE CASCADE
ON DELETE CASCADE
);
.mode ascii
.separator "\t" "\n"
.import '$TRANS_OUT/links.tsv' links
.headers off
SELECT "vacuum...";
VACUUM;
SELECT "checking database integrity...";
PRAGMA integrity_check;
EOF
fi
# translate single sentence
# select * from sentences s JOIN links l ON s.sentenceid = l.sentenceid JOIN sentences s2 ON s2.sentenceid = l.translationid where s.:sqlite> select * from sentences s JOIN links l ON s.sentenceid = l.sentenceid JOIN sentences s2 ON s2.sentenceid = l.translationid where s.sentence='D''accord.' and s2.lang = 'deu' limit 10;
# finds duplicate sentences
sqlite3 -batch $SQLITEDB "select sentence, GROUP_CONCAT(sentenceid) from sentences GROUP BY sentence,lang HAVING COUNT(sentenceid) > 1 LIMIT 50"
Wall thread: https://tatoeba.org/eng/wall/show_message/33106#message_33106
Alan suggested I create a GitHub ticket and add more information, so here it is. I used spoilers to hide most of the gruesome details by default and added checkboxes to each group of characters so we have some chance of keeping track of the progress that will hopefully be made.
I'd like to apologize to anyone who receives an email notification about this in a client that doesn't support the spoiler tags.
EDIT: Since GitHub doesn't like it when people post huge amounts of text in the issue description, I had to abbreviate a bit. (ex:6674905,uses:16) refers to a character appearing in 16 different sentences, one of which is 6674905.
Duplicate Encodings a.k.a. Unicode NFC
[x]
ά → ά έ → έ ή → ή ί → ί ό → ό ύ → ύ ώ → ώ Affects: Ancient Greek [grc]
[x]
不 → 不 粒 → 粒 行 → 行 Affects: Literary Chinese [lzh], Cantonese [yue]
Duplicate Encodings (multiple codepoints)
[x]
à → à á → á â → â ã → ã ä → ä ả → ả å → å ạ → ạ ć → ć ĉ → ĉ ç → ç è → è é → é ê → ê ẹ → ẹ ę → ę ĝ → ĝ ḥ → ḥ ì → ì í → í ỉ → ỉ ị → ị ĵ → ĵ ň → ň ò → ò ó → ó õ → õ ö → ö ỏ → ỏ ọ → ọ ǫ → ǫ ṛ → ṛ ŝ → ŝ ṣ → ṣ ş → ş ṭ → ṭ ù → ù ú → ú ũ → ũ ŭ → ŭ ü → ü ủ → ủ ụ → ụ ý → ý ẓ → ẓ ầ → ầ ấ → ấ ẫ → ẫ ậ → ậ ề → ề ế → ế ễ → ễ ệ → ệ ố → ố ỗ → ỗ ổ → ổ ằ → ằ ắ → ắ ẳ → ẳ ặ → ặ ờ → ờ ớ → ớ ở → ở ợ → ợ ừ → ừ ứ → ứ ữ → ữ ử → ử ự → ự Affects: Finnish [fin], Interlingue [ile], Spanish [spa], Turkmen [tuk], Russian [rus], Esperanto [epo], Swedish [swe], Yoruba [yor], Tatar [tat], Shuswap [shs], Hungarian [hun], Italian [ita], Lingala [lin], Cayuga [cay], French [fra], Vietnamese [vie], Berber [ber], Navajo [nav], Serbian [srp], Kabyle [kab], Turkish [tur]
[x]
й → й Affects: Bashkir [bak]
[x]
آ → آ أ → أ ؤ → ؤ Affects: Arabic [ara], Urdu [urd], Persian [pes]
[x]
ऱ → ऱ क़ → क़ ख़ → ख़ ग़ → ग़ ज़ → ज़ ड़ → ड़ ढ़ → ढ़ फ़ → फ़ Affects: Marathi [mar], Hindi [hin], Garhwali [gbm]
[x]
ড় → ড় ঢ় → ঢ় য় → য় Affects: Bengali [ben], Assamese [asm]
[x]
ਸ਼ → ਸ਼ ਖ਼ → ਖ਼ ਗ਼ → ਗ਼ ਜ਼ → ਜ਼ ਫ਼ → ਫ਼ Affects: Punjabi (Eastern) [pan]
[x]
ோ → ோ Affects: Tamil [tam]
[x]
ೀ → ೀ ೊ → ೊ ೋ → ೋ ೇ → ೇ Affects: Kannada [kan]
[x]
ോ → ോ Affects: Malayalam [mal]
[x]
יִ → יִ ײַ → ײַ שׂ → שׂ אַ → אַ אָ → אָ וּ → וּ כּ → כּ פּ → פּ תּ → תּ בֿ → בֿ כֿ → כֿ פֿ → פֿ Affects: Hebrew [heb], Yiddish [yid]
[x]
ָֹ → ָֹ ְּ → ְּ ֳּ → ֳּ ִּ → ִּ ֵּ → ֵּ ֶּ → ֶּ ַּ → ַּ ָּ → ָּ ֹּ → ֹּ ֻּ → ֻּ ְׁ → ְׁ ִׁ → ִׁ ֶׁ → ֶׁ ַׁ → ַׁ ָׁ → ָׁ ֹׁ → ֹׁ ֻׁ → ֻׁ ְׂ → ְׂ ִׂ → ִׂ ֵׂ → ֵׂ ָׂ → ָׂ ֹׂ → ֹׂ َّ → َّ ُّ → ُّ ِّ → ِّ ़् → ़् ့် → ့် Affects: Arabic [ara], Persian [pes], North Levantine Arabic [apc], Hindi [hin], Yiddish [yid], Hebrew [heb], Algerian Arabic [arq], Burmese [mya]
Near Duplicates a.k.a. Unicode NFKC
[x]
ª → a º → o Affects: Finnish [fin], Esperanto [epo], Lingua Franca Nova [lfn], German [deu], English [eng], Japanese [jpn], French [fra], Italian [ita], Turkish [tur], Danish [dan], Ukrainian [ukr], Spanish [spa], Interlingua [ina], Portuguese [por], Russian [rus]
[ ]
⁰ → 0 ⁸ → 8 ⁿ → n Affects: Danish [dan], Russian [rus], Portuguese [por], French [fra], German [deu], Finnish [fin], Esperanto [epo], Ukrainian [ukr], Japanese [jpn], English [eng], Choctaw [cho]
[ ]
₁ → 1 ₂ → 2 ₃ → 3 ₄ → 4 ₈ → 8 ₙ → n Affects: Danish [dan], Thai [tha], Esperanto [epo], Macedonian [mkd], Hungarian [hun], French [fra], Turkish [tur], Italian [ita], Czech [ces], Japanese [jpn], Dutch [nld], Finnish [fin], English [eng], Marathi [mar], Spanish [spa], Russian [rus], Kabyle [kab], Interlingua [ina], Portuguese [por], Welsh [cym], German [deu], Basque [eus], Ukrainian [ukr], Vietnamese [vie]
[ ]
① → 1 ② → 2 Affects: Japanese [jpn]
[ ]
𝑎 → a 𝑏 → b 𝑐 → c 𝑒 → e 𝑖 → i 𝑘 → k 𝑚 → m 𝑛 → n 𝑟 → r 𝑥 → x 𝑦 → y 𝘨 → g 𝜀 → ε 𝜋 → π Affects: Spanish [spa], Esperanto [epo], Russian [rus], German [deu]
[ ]
ℎ → h Affects: German [deu]
[ ]
ℵ → א Affects: German [deu]
[ ]
ʰ → h ʷ → w ⵯ → ⵡ Affects: Kabyle [kab], Waray [war], Berber [ber], English [eng], Khmer [khm], Ngeq [ngt]
[ ]
ſ → s Affects: Middle French [frm]
[ ]
ﮐ → ک ﺋ → ئ ﺎ → ا ﺣ → ح ﺹ → ص ﻊ → ع ﻋ → ع ﻞ → ل ﻠ → ل ﻣ → م ﻪ → ه Affects: Ottoman Turkish [ota]
[ ]
⺟ → 母 ⼀ → 一 ⾯ → 面 ⾷ → 食 Affects: Min Nan Chinese [nan]
[ ]
µ → μ Affects: Greek [ell]
Near Duplicates (multiple codepoints)
[ ]
ij → ij և → եւ fi → fi ﻹ → لإ ﻻ → لا ﻼ → لا Affects: Arabic [ara], Armenian [hye], Ottoman Turkish [ota], Dutch [nld], Irish [gle]
[ ]
㌔ → キロ ㌘ → グラム Affects: Japanese [jpn]
[ ]
ำ → ํา Affects: Thai [tha]
[ ]
ໜ → ຫນ ໝ → ຫມ Affects: Lao [lao]
Case Alternatives a.k.a. fixed points under iterative application of Unicode NFKC, uppercasing and lowercasing using ICU
[ ]
H → h I → ı J → j U → u W → w Á → á Â → â Ä → ä Å → å É → é Ú → ú Ā → ā Č → č Ē → ē Ġ → ġ Ĥ → ĥ Ī → ī İ → i ı → i Ĵ → ĵ Ļ → ļ Ľ → ľ Ł → ł Ņ → ņ Ŝ → ŝ Ū → ū ℂ → c ℃ → c ℕ → n ℝ → r Ꞌ → ꞌ 𝐴 → a 𝐵 → b 𝐾 → k 𝑁 → n 𝑋 → x Affects: Polish [pol], Finnish [fin], Ottoman Turkish [ota], English [eng], Japanese [jpn], Kashmiri [kas], Ido [ido], Dutch [nld], Danish [dan], Spanish [spa], Lojban [jbo], Portuguese [por], Russian [rus], Turkmen [tuk], Bashkir [bak], Esperanto [epo], Old East Slavic [orv], Latvian [lvs], Croatian [hrv], Talysh [tly], Latin [lat], Tatar [tat], Hungarian [hun], Unknown Language, Italian [ita], Lower Sorbian [dsb], Greek [ell], Chamorro [cha], Zaza [zza], German [deu], French [fra], Kashubian [csb], Czech [ces], Berber [ber], Slovak [slk], Navajo [nav], Upper Sorbian [hsb], Azerbaijani [aze], Turkish [tur], Crimean Tatar [crh], Chuvash [chv]
[x]
Ԑ → ԑ Affects: Kabyle [kab]
[x]
¨ → ̈ ´ → ́ ˙ → ̇ ˚ → ̊ Affects: Finnish [fin], Guarani [grn], Low German (Low Saxon) [nds], English [eng], Dutch [nld], Spanish [spa], Portuguese [por], Esperanto [epo], Old Tupi [tpw], Ukrainian [ukr], Italian [ita], Catalan [cat], Greek [ell], Mandarin Chinese [cmn], German [deu], French [fra], Czech [ces], Berber [ber], Slovak [slk], Ancient Greek [grc], Turkish [tur], Occitan [oci]
[x]
𑢩 → 𑣉 𑢮 → 𑣎 𑢯 → 𑣏 Affects: Ho [hoc]
[x]
ͅ → ι ΄ → ́ Ά → α Έ → ε Ή → ή Ί → ι Ό → ο Ύ → υ Ώ → ω ΐ → ϊ ά → α έ → ε ί → ι ς → σ ό → ο ύ → υ ώ → ω ἀ → α ἁ → α ἄ → α Ἀ → ἀ Ἄ → α Ἄ → ἄ Ἆ → ἆ ἐ → ε ἔ → ε ἕ → ε Ἐ → ἐ Ἑ → ἑ Ἓ → ε Ἓ → ἓ Ἔ → ε Ἔ → ἔ ἠ → η ἡ → η ἦ → ή Ἡ → ἡ Ἢ → ἢ Ἥ → ἥ Ἦ → ἦ ἰ → ι ἱ → ι ἶ → ι Ἰ → ἰ Ἱ → ἱ ὁ → ο ὅ → ο Ὀ → ὀ Ὁ → ο Ὁ → ὁ Ὃ → ὃ Ὄ → ὄ Ὅ → ὅ ὐ → υ ὔ → υ Ὑ → ὑ Ὕ → ὕ ὠ → ω ὡ → ω Ὡ → ὡ Ὤ → ὤ Ὦ → ὦ ὰ → α ὲ → ε έ → ε ὴ → ή ὶ → ι ί → ι ὸ → ο ὺ → υ ὼ → ω ώ → ω ᾶ → α ᾽ → ̓ ᾿ → ̓ ῆ → ή ῖ → ι ῦ → υ ῶ → ω ῾ → ̔ Affects: Ancient Greek [grc], Greek [ell], Portuguese [por]
[x]
Ա → ա Բ → բ Գ → գ Դ → դ Ե → ե Զ → զ Է → է Ը → ը Թ → թ Ժ → ժ Ի → ի Լ → լ Խ → խ Ծ → ծ Կ → կ Հ → հ Ձ → ձ Ղ → ղ Ճ → ճ Մ → մ Յ → յ Ն → ն Շ → շ Ո → ո Չ → չ Պ → պ Ջ → ջ Ս → ս Վ → վ Տ → տ Ց → ց Ւ → ւ Փ → փ Ք → ք Օ → օ Ֆ → ֆ Affects: Armenian [hye]
[x]
Ꭰ → ꭰ Ꭱ → ꭱ Ꭴ → ꭴ Ꭶ → ꭶ Ꭷ → ꭷ Ꭸ → ꭸ Ꭹ → ꭹ Ꭺ → ꭺ Ꭼ → ꭼ Ꭽ → ꭽ Ꭿ → ꭿ Ꮂ → ꮂ Ꮃ → ꮃ Ꮅ → ꮅ Ꮆ → ꮆ Ꮈ → ꮈ Ꮎ → ꮎ Ꮑ → ꮑ Ꮒ → ꮒ Ꮓ → ꮓ Ꮕ → ꮕ Ꮖ → ꮖ Ꮗ → ꮗ Ꮙ → ꮙ Ꮛ → ꮛ Ꮜ → ꮜ Ꮝ → ꮝ Ꮟ → ꮟ Ꮡ → ꮡ Ꮢ → ꮢ Ꮣ → ꮣ Ꮤ → ꮤ Ꮥ → ꮥ Ꮧ → ꮧ Ꮨ → ꮨ Ꮩ → ꮩ Ꮪ → ꮪ Ꮭ → ꮭ Ꮰ → ꮰ Ꮱ → ꮱ Ꮲ → ꮲ Ꮳ → ꮳ Ꮵ → ꮵ Ꮷ → ꮷ Ꮸ → ꮸ Ꮹ → ꮹ Ꮺ → ꮺ Ꮻ → ꮻ Ꮼ → ꮼ Ꮿ → ꮿ Ᏸ → ᏸ Ᏹ → ᏹ Ᏺ → ᏺ Ᏼ → ᏼ Affects: Cherokee [chr]
[ ]
゜ → ゚ Affects: Japanese [jpn]