Closed PonteIneptique closed 6 years ago
The same happens with unidecode_expect_nonascii
"ў" is Cyrillic letter "u" (U+045E CYRILLIC SMALL LETTER SHORT U).
From Wikipedia:
It is normally romanised as "u", but in Kazakh, it is romanised as "w".
I believe Unidecode output is correct in this case.
Well, It's certainly interesting... This is generally used in literature studies to mark short voyels (in this case y). Sorry for opening a bug then :)
If it is supposed to be a Latin letter "y" with a breve diacritic in your text ("y̆"), then I think it should be encoded as U+0079 U+0306 (Latin letter y, followed by a combining breve character). Unicode doesn't provide a canonical composition in this case (a single code point - like it does for the Cyrilic U+045E), but the decomposed form should still be valid. In that case Unidecode will correctly transliterate "y̆" into "y".
So in short, I believe this is a problem with the input you are giving Unidecode, not Unidecode itself.
Hi there ! I hope this is really a bug and not me not being sure of what's going on... Anyway.
Context
unidecode==0.4.21
Reproduction of the bug
Expected
print(unidecode("ў"))
should printy
as wellI'll try to see if I can understand the reason why it works this way in the codebase but at least the issue is open...