avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
533 stars 62 forks source link

Some Chinese names have unnecessary spaces at the end when transliterating #64

Open taraskuzyk opened 3 years ago

taraskuzyk commented 3 years ago

When trying to transliterate
"马云" I receive
"Ma Yun " (notice the space in the end) instead of
"Ma Yun"

Here's the code you can use to replicate this issue:

import unittest
import unidecode

class TestStrings(unittest.TestCase):
    def test_replace_non_ascii_letters_with_chinese_name(self):
        self.assertEquals(unidecode.unidecode("马云"), "Ma Yun")

The test fails with the following error:

AssertionError: 'Ma Yun ' != 'Ma Yun'
- Ma Yun 
?       -
+ Ma Yun

Run on Python 3.8.5

EDIT:

Google Translate seems to be doing this with no issue, but perhaps Google Translate has the faulty transliteration. Chinese speakers welcome to correct me.

Screen Shot 2021-05-20 at 4 21 49 PM
avian2 commented 3 years ago

The technical reason why transliteration for each letter includes a space at the end is because otherwise you would not get spaces between letters. In your example you would get "MaYun". Unidecode just does a simple mapping from a Unicode character to ASCII sequences and doesn't know which letter appears last in your name. Hence the last letter will leave a trailing space.

I don't speak Chines, but the original author of Unidecode thought it was better to have spaces so I will leave it like that.