avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Normalizing fancy characters #72

Open GokulNC opened 2 years ago

GokulNC commented 2 years ago

Thanks for making the library, it's really helpful in my case for cleaning social media texts.

Here are some cases where the transliteration/conversion was not correct (Version 1.3.2):

>>> from unidecode import unidecode
>>> unidecode("ᕼᗩᑭᑭIᗴᗴ")
'hpokikiIgaga'
>>> unidecode("🇦🇷🇮")
''
>>> unidecode("ωεłł")
'oell'
>>> unidecode("RᗅIPႮ")
'RghoIPP'
>>> unidecode("ғʀᴇᴇ")
"g'REE"

I will update this issue with more examples as I come across. Thanks!


Edit:

It looks like most of the issues is because they are characters of some other scripts, not fancy characters. So, is there some way to do appearance-based conversion rather than approximate-phonetic conversion?