avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
517 stars 62 forks source link

Digraphs and trigraphs #14

Closed bittlingmayer closed 7 years ago

bittlingmayer commented 7 years ago

Truly arbitrary digraphs and trigraphs are less common in non-Latin scripts, but still happen.

unidecode("Բովանդակություն")

actual: 'Bovandakowt`yown' expected: 'Bovandakut`yun'

The current behaviour is not necessarily wrong, but a bit undesirable because it breaks the roundtrip expectation ('Zulu' == unidecode(transliterate('Zulu', lang='hy'))).

There may be other cases mentioned in https://en.wikipedia.org/wiki/Digraph_(orthography)#Examples that are not covered. Some of them are working as intended though, because for our purposes, logical digraphs like дж is not a true arbitrary digraph because it is still two chars in Latin and even three in Ascii, so mapping from each char without context still works.

avian2 commented 7 years ago

I don't know anything about the transliterate() function you are using.

If I understand correctly you are suggesting that Unidecode should recognize language-specific digraphs (pairs of Unicode characters) and transliterate them differently than individual characters? If so, I'm afraid my answer to this is the same as #15.

bittlingmayer commented 7 years ago

(transliterate() was just a hypothetical.)

In this case, it would not be language-specific but script-specific and thus universally applicable, without any language parameter, ie unlike #15. But if would be context-dependent, with context being simply the preceding character, so simple principle of longest match, it's still just a table.

'ու' is just how 'u' is written in Armenian script, it's the only way to write it, and it's a single entry in the dictionary, whereas standalone ւ is not an entry at all anymore.

(I am trying but failing to think of analogies for transliteration between other scripts.)

After consulting with some experts, I still believe that the current implementation is strange. 'u' is always correct for modern Eastern Armenian (eg Wikipedia), it's used by Google Translate, and for the antiquated orthographies where it is scientific but not quite reflective of the pronunciation (eg պատուական), 'ow' is even less correct.

As it happens, and Google Translate choice to transliterate 'ու' as 'u' but otherwise transliterate 'ւ' as 'w'. I find conflicting information on the ISO standard, but all other schemes, including the classic ones (which would be my only area of concern), used 'u' as I suggest.

I don't wish to push one way or another, just present you the evidence so you can apply your principles. Is this one also simply corresponding to the Perl implementation?

avian2 commented 7 years ago

Thanks for researching this and sharing your findings. Unfortunately I don't feel confident in making a decision here, especially given my past experience with such changes. I checked, and the Perl library (version 1.30, latest release at the moment) transliterates ւ as w.

The fact is that I'm not prepared to invest the time, neither do I have the linguistic knowledge, to develop Python Unidecode independently from the Perl version (which would be required, if context-dependent transliteration would be implemented).

bittlingmayer commented 7 years ago

Understand.

(For others hitting this issue, for my purposes I will just replace 'ու' with 'u' before passing the string to unidecode.)