avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

Some Minor Bugs of Transliteration #16

Closed BLKSerene closed 5 years ago

BLKSerene commented 7 years ago
  1. 呆 Ai -> Dai (Pronunciation in Chinese)

  2. I found that unidecode will transform all characters to its pronunciation in Chinese even those the text is actually written in Japanese Kanji.

  3. While transforming Chinese characters to their pronunciations, a trailing whitespace is always added to the end of the text. 阿傍 -> A Bang (There's is a trailing whitespace at the end)

  4. While transforming text composed of Chinese characters and Japanese katakana/hiragana, the whitespace is left out. 阿呆の足下使い -> A Ai noZu Xia Shi i -> A Ai no Zu Xia Shi i (And, these are Japanese characters, not Chinese characters)

avian2 commented 7 years ago

It is unlikely these problems will be fixed. Please read this document (specifically sections When you don't like what Unidecode does and Caveats) for some reasons why this is a hard problem to solve with the approach Unidecode takes to transliteration:

http://search.cpan.org/~sburke/Text-Unidecode-1.30/lib/Text/Unidecode.pm

You might want to use https://github.com/miurahr/unihandecode instead.

BLKSerene commented 5 years ago

OK, I got it. Thanks anyway!