Difference in NFC vs NFD decoding

jakepoz commented 3 years ago

Unidecode gives different values if your string is encoded in NFC (short form) or NFD mode (where a single letter may be represented by the letter, followed by a separate unicode accent mark).

unidecode.unidecode(unicodedata.normalize('NFC', 'Ѐ')) = 'Ie'
unidecode.unidecode(unicodedata.normalize('NFD', 'Ѐ')) = 'E'

Is this expected behavior?

avian2 commented 3 years ago

In general, Unidecode doesn't guarantee that different Unicode normalizations of a string will produce the same output.

In your specific case, it might be that U+0400 (accented Cyrillic letter IE) should have been changed from "Ie" to "E" in this old commit, same as was done for the unaccented variant: https://github.com/avian2/unidecode/commit/8c0dbddfe532668eebb0a746ebe7cccd28907abb

Unfortunately I don't have any context for that change. The email address of the committer is invalid, so I can't ask them.

jakepoz commented 3 years ago

Got it, if that's not a guarantee in general, then we won't depend on it!

avian2 / unidecode

Difference in NFC vs NFD decoding #62