avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Faulty transliteration of half-width Katakana with dakuten and handakuten #51

Closed Khris777 closed 5 years ago

Khris777 commented 5 years ago

Example code to test japanese letters in Hiragana, full-width Katakana and half-width Katakana using dakuten and handakuten:

import unidecode as ud

hiragana    = "はひほへほ ばびぶぼべ ぱぴぷぺぽ"
katakana_fw = "ハヒフヘホ バビブベボ パピプペポ"
katakana_hw = "ハヒフヘホ バビブベボ パピプペポ"

print(ud.unidecode(hiragana))
print(ud.unidecode(katakana_fw))
print(ud.unidecode(katakana_hw))

The result for Hiragana and Katakana full width is correct, however the result for half-width Katakana is not:

hahihoheho babibubobe papipupepo
hahihuheho babibubebo papipupepo
hahihuheho ha:hi:hu:he:ho: ha;hi;hu;he;ho;

Instead of the correct transliteration "ba" it returns "ha:" and instead of "pa" it returns "ha;".

avian2 commented 5 years ago

Hi

If I understand correctly, the problem you are reporting is specifically with these transliterations:

>>> unidecode('\uff8a')
'ha'  # ok
>>> unidecode('\uff8a\uff9e')
'ha:'  # should be 'ba'
>>> unidecode('\uff8a\uff9f')
'ha;' # should be 'pa'

Unfortunately, this will not be fixed in Unidecode, since it would require transliterating U+FF8A differently depending on context (in other words, whether it appears together with U+FF9E or U+FF9F). As it says in the README, context-sensitive transliteration is outside of the scope of this library.

Please either do these specific replacements in your own code or try another library that does language-specific transliterations, such as Unihandecode.