barseghyanartur / transliterate

Bi-directional transliterator for Python. Transliterates (unicode) strings according to the rules specified in the language packs.
https://pypi.python.org/pypi/transliterate
296 stars 50 forks source link

Greek transliteration is non deterministic #47

Open hoschwenk opened 5 years ago

hoschwenk commented 5 years ago

Transliteration of Greek is non deterministic ! Running translit('Δεν του μίλησα ξανά.', 'el', reversed=True) several times Gives "den toy milisa xana." or "den tou milisa xana." Maybe both are correct but the tool should always output the same one ! If not, results are not reproducible, e.g. when used in a machine translation system.

This happens if you start python3 several times. not when called in a loop

barseghyanartur commented 4 years ago

@hoschwenk

Thanks for bringing this up.

There have been numerous attempts and PRs to bring corectness to Greek transliteration.

I'm all open for correctness and thus willing to accept a valid PR.

I think back in the day, I have used this Wikipedia article as a valid and trustworthy source of information on the topic.

Could you please double check your findings with the mentioned Wikipedia article and let me know if current interpretation of transliterate isn't correct?

Thank you!

akosiaris commented 2 years ago

I am unable to reproduce this on master (9333f24) and python 3.9.2

for i in `seq 1 10000` ; do python3 foo.py ; done | sort | uniq -c | sort -rn
   10000 Den toy milisa xana.

with foo.py containing

import transliterate

print(transliterate.translit('Δεν του μίλησα ξανά.', 'el', reversed=True))

This isn't easy to reproduce right now (which isn't surprising, 3 years have passed since 2019)

Judging from the report, I would say that we no longer are able to reproduce this cause starting with cpython 3.5 and finalized in the python spec in 3.7, standard dictionary objects preserve order. Given the following stanza in the pre_processor_mapping of the greek language

    u"Ou": u"Ου",
    u"ou": u"ου",
    u"Oy": u"Ου",
    u"oy": u"ου",

it makes sense that the dictionaries are initialized with different orders on subsequent executions in python version pre 3.5.

I 'd say that this explains the inconsistent behavior. It also means that by now it has become extremely rare and will only show up when using older and unsupported python versions.

However, the transliteration in the example above is just wrong.

I am not sure where the 2nd mapping comes from but it should not be there. ου in both ISO 843[1], the international ratification of ELOT 743 v1 with a couple of minor differences, and ELOT 743 version 2 type 1 [2] (the Greek cross ratification of ISO 843 to adopt the above minor differences) specifically set an exception for the double vowel ου, which needs to be transliterated as ou and vice versa. There is no mapping exception to/from oy, so while oy would be transliterated per the general rules to ου the inverse would never be true in a transliteration context (transcription, which favors pronunciation is a different story). It's important to note that nor the UN nor the ALA-LC (library of congress) treat ου differently than ISO-843/ELOT 743 v2 (which isn't the case for some other mappings).

@barseghyanartur I 'll submit a PR to remove the oy mapping to conform with the 2 standards (and also UN and ALA-LC). Let me know if you disagree. Incidentally that would also resolve this specific issue in older python versions.

[1] https://en.wikipedia.org/wiki/ISO_843 [2] https://sete.gr/files/Media/Egkyklioi/040707Latin-Greek.pdf