avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

y transformed to u? #22

Closed PonteIneptique closed 6 years ago

PonteIneptique commented 6 years ago

Hi there ! I hope this is really a bug and not me not being sure of what's going on... Anyway.

Context

Reproduction of the bug

from unidecode import unidecode
print(unidecode("y"))  # prints y
print(unidecode("ў"))  # prints u

Expected

print(unidecode("ў")) should print y as well

I'll try to see if I can understand the reason why it works this way in the codebase but at least the issue is open...

PonteIneptique commented 6 years ago

The same happens with unidecode_expect_nonascii

avian2 commented 6 years ago

"ў" is Cyrillic letter "u" (U+045E CYRILLIC SMALL LETTER SHORT U).

From Wikipedia:

It is normally romanised as "u", but in Kazakh, it is romanised as "w".

I believe Unidecode output is correct in this case.

PonteIneptique commented 6 years ago

Well, It's certainly interesting... This is generally used in literature studies to mark short voyels (in this case y). Sorry for opening a bug then :)

avian2 commented 6 years ago

If it is supposed to be a Latin letter "y" with a breve diacritic in your text ("y̆"), then I think it should be encoded as U+0079 U+0306 (Latin letter y, followed by a combining breve character). Unicode doesn't provide a canonical composition in this case (a single code point - like it does for the Cyrilic U+045E), but the decomposed form should still be valid. In that case Unidecode will correctly transliterate "y̆" into "y".

So in short, I believe this is a problem with the input you are giving Unidecode, not Unidecode itself.