avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

Language parameter? #15

Closed bittlingmayer closed 7 years ago

bittlingmayer commented 7 years ago

Major scripts like Latin, Cyrillic and Arabic are used to write many languages. Right now, the lib is implicitly unidecoding all Cyrillic as if it were Russian specifically.

For example:
unidecode('Халид Бешлић, Цеца, Жељко Јоксимовић') # sr

actual: 'Khalid Beshlitsh, Tsetsa, Zheljko Joksimovitsh'

expected:
'Halid Beslic, Ceca, Zeljko Joksimovic'

That's because there should be an implicit initial conversion to Latin ('Halid Bešlić, Ceca, Željko Joksimović'). Passing that to unidecode works as expected.

Similar is true for Latin. For example: unidecode('Kadıköy') # tr
unidecode('Schönheimer') # de

actual: 'Kadikoy'
'Schonheimer'

expected: 'Kadikoy'
'Schoenheimer'

The most straightforward fix is an optional parameter lang.

avian2 commented 7 years ago

Sorry, language-dependent transliteration is out of scope of Unidecode. There are other projects that do that. You might want to use Unihandecode instead for example, which does implement a lang parameter.

I understand that this is a very common problem, but it will not be fixed in this library. Please see arguments in the following on why Unidecode will not go beyond context-free character replacements:

https://www.tablix.org/~avian/blog/archives/2013/09/python_unidecode_release_0_04_14/

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

bittlingmayer commented 7 years ago

Fair enough, I understand this is a very hard task to do well and impossible to do perfectly (and thank you for the links to enjoyable reading and to unihandecode).

I see the current behaviour for Cyrillic -- favouring the Russian interpretation over others -- as unfortunately similar to the undesirable behaviour with ö and ü -- favouring the German interpretation.

So in that sense, it's inconsistent with your stated principle. On the other hand, to change it now would be inconsistent with the principle of not breaking existing code.

avian2 commented 7 years ago

As a Slovenian native speaker I find the Serbian transliteration more understandable as well. Alas, I'm an engineer not a linguist. Hence I trust Sean Burke, who originally wrote Unidecode and is way more qualified to decide on such things. After the whole German umlaut issue I'm very hesitant to accept changes like that.

You might want to raise the issue with him. If he changes Cyrillic tables in his Perl module I would be happy to accept the same change in Python Unidecode.