PrinsFrank / standards

A collection of standards as PHP Enums: ISO3166, ISO4217, ISO639...
MIT License
393 stars 10 forks source link

Non-ASCII characters in source code #135

Closed szepeviktor closed 9 months ago

szepeviktor commented 9 months ago

There is an open-mid back unrounded vowel in the source code. https://github.com/PrinsFrank/standards/blob/f353d2953d086b423f66d6134fc98fc99978d8d3/src/Scripts/ScriptName.php#L148

and these IPA Extensions https://www.unicode.org/charts/PDF/U0250.pdf https://github.com/PrinsFrank/standards/blob/f353d2953d086b423f66d6134fc98fc99978d8d3/src/Language/LanguageAlpha3Extensive.php#L1727 https://github.com/PrinsFrank/standards/blob/f353d2953d086b423f66d6134fc98fc99978d8d3/src/Scripts/ScriptAlias.php#L100

Find them all

LC_ALL=C.UTF-8 git grep --perl-regexp --line-number -I -e '[^ -~]' --and --not -e "'.*[^ -~].*'"
szepeviktor commented 9 months ago
szepeviktor commented 9 months ago

Oh no! https://github.com/PrinsFrank/standards/blob/f353d2953d086b423f66d6134fc98fc99978d8d3/dev/DataTarget/NameNormalizer.php#L13 There is transliteration already. It fails!

szepeviktor commented 9 months ago

It is not a PHP bug, https://icu4c-demos.unicode.org/icu-bin/translit also leaves naci gʌba unchanged.

szepeviktor commented 9 months ago

There is a proper way to convert IPA characters into ASCII: X-SAMPA. https://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm

szepeviktor commented 9 months ago

Unicode code | IPA character | X-SAMPA/with fallback: transliteration

U+0250|ɐ|6
U+0251|ɑ|A
U+0252|ɒ|Q
U+0253|ɓ|b
U+0254|ɔ|O
U+0255|ɕ|s
U+0256|ɖ|d
U+0257|ɗ|d
U+0258|ɘ|N/A
U+0259|ə|N/A
U+025A|ɚ|N/A
U+025B|ɛ|E
U+025C|ɜ|3
U+025D|ɝ|N/A
U+025E|ɞ|3
U+025F|ɟ|J
U+0260|ɠ|g
U+0261|ɡ|g
U+0262|ɢ|G
U+0263|ɣ|G
U+0264|ɤ|7
U+0265|ɥ|H
U+0266|ɦ|h
U+0267|ɧ|x
U+0268|ɨ|1
U+0269|ɩ|N/A
U+026A|ɪ|I
U+026B|ɫ|5
U+026C|ɬ|K
U+026D|ɭ|l
U+026E|ɮ|K
U+026F|ɯ|M
U+0270|ɰ|M
U+0271|ɱ|F
U+0272|ɲ|J
U+0273|ɳ|n
U+0274|ɴ|N
U+0275|ɵ|8
U+0276|ɶ|OE
U+0277|ɷ|N/A
U+0278|ɸ|p
U+0279|ɹ|r
U+027A|ɺ|l
U+027B|ɻ|r
U+027C|ɼ|r
U+027D|ɽ|r
U+027E|ɾ|4
U+027F|ɿ|N/A
U+0280|ʀ|R
U+0281|ʁ|R
U+0282|ʂ|s
U+0283|ʃ|S
U+0284|ʄ|J
U+0285|ʅ|N/A
U+0286|ʆ|N/A
U+0287|ʇ|N/A
U+0288|ʈ|t
U+0289|ʉ|u
U+028A|ʊ|U
U+028B|ʋ|P
U+028C|ʌ|V
U+028D|ʍ|W
U+028E|ʎ|L
U+028F|ʏ|Y
U+0290|ʐ|z
U+0291|ʑ|z
U+0292|ʒ|Z
U+0293|ʓ|N/A
U+0294|ʔ|N/A
U+0295|ʕ|N/A
U+0296|ʖ|N/A
U+0297|ʗ|N/A
U+0298|ʘ|O
U+0299|ʙ|B
U+029A|ʚ|N/A
U+029B|ʛ|G
U+029C|ʜ|H
U+029D|ʝ|j
U+029E|ʞ|N/A
U+029F|ʟ|L
U+02A0|ʠ|q
U+02A1|ʡ|N/A
U+02A2|ʢ|N/A
U+02A3|ʣ|dz
U+02A4|ʤ|N/A
U+02A5|ʥ|dz
U+02A6|ʦ|ts
U+02A7|ʧ|N/A
U+02A8|ʨ|N/A
U+02A9|ʩ|N/A
U+02AA|ʪ|ls
U+02AB|ʫ|lz
U+02AC|ʬ|N/A
U+02AD|ʭ|N/A
U+02AE|ʮ|N/A
U+02AF|ʯ|N/A