avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Difference in NFC vs NFD decoding #62

Closed jakepoz closed 3 years ago

jakepoz commented 3 years ago

Unidecode gives different values if your string is encoded in NFC (short form) or NFD mode (where a single letter may be represented by the letter, followed by a separate unicode accent mark).

unidecode.unidecode(unicodedata.normalize('NFC', 'Ѐ')) = 'Ie'
unidecode.unidecode(unicodedata.normalize('NFD', 'Ѐ')) = 'E'

Is this expected behavior?

avian2 commented 3 years ago

In general, Unidecode doesn't guarantee that different Unicode normalizations of a string will produce the same output.

In your specific case, it might be that U+0400 (accented Cyrillic letter IE) should have been changed from "Ie" to "E" in this old commit, same as was done for the unaccented variant: https://github.com/avian2/unidecode/commit/8c0dbddfe532668eebb0a746ebe7cccd28907abb

Unfortunately I don't have any context for that change. The email address of the committer is invalid, so I can't ask them.

jakepoz commented 3 years ago

Got it, if that's not a guarantee in general, then we won't depend on it!