avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
516 stars 62 forks source link

Imporve Hebrew and Yiddish #68

Closed alonbl closed 2 years ago

eyaler commented 3 years ago

thanks! you beat me to it. i gave you edit permission, and replied to some of your comments. perhaps you can add a column to make it easier to see which ones you took/ignored/changed? I think the main open issue is the caps. afaiu, the reason you suggested eg SH to be capitalized is to differentiate SH (ש) from s+h (סה). when letters retain their regular meaning as OY=o+y there is no reason for caps. unidecode does not seem to have a policy requiring all substitutions more than 1 char length to be in caps. the transformation is not reversible. I think caps should be reserved for special cases of:

alonbl commented 3 years ago

Thanks, I will fill the file.

I believe that the difference between us is that you think that we should be phonetic compatible, while I believe this is wrong.

The main use case for the unidecode and similar translation is not for non-native speaker reading the text and "vocally" transmit it to native speaker to understand, but for native people to be able to understand the transformation when native character set is not supported.

For example a car mp3 player which does not support Hebrew. A native speaker has no choice but to use a transformation to be able to read the titles. If there is a redundancy in the translation it is very difficult to grasp the origin word. Believe me I've tried... Out entire family had played a competition who gets it first... and there were some we could not figure out the original term, although once we had it it was obvious... but still we failed.

This is the reason why 'צ' MUST be 'TS' and not 'ts' as it takes hell amount of time to try all combinations to figure out what was the transformation. This is the reason why 'א' cannot be the same as 'ע', and also the reason of the difference between 'ק' and 'כ' and so on.

I hope you agree with me about the pattern, this will settle most of the differences.

Regards, Alon

alonbl commented 3 years ago

I updated the document[1], filter by 1st columns for all opened issues. Added a new like for 'ט'.

[1] https://docs.google.com/spreadsheets/d/1fvQtyDxiVbz4Yp2FY1fSvZ9qVugo2KKC_yX8LofAUGU/edit#gid=0

eyaler commented 3 years ago

if you have willing would be happy to discuss over phone - i think it would be more efficient. of course this is a transliteration (as opposed to a phonetic transcription), since you cannot know the pronunciation from a single letter. were we presumably differ is that i do NOT believe that the transformation should be reversible, which by definition is what unidecode is doing by mapping everything to the ascii 127 range. therefore it is fine afaiac that two hebrew chars map to the same ascii char. and you already have other examples like both kamats and patah map to "a". both samekh and sin map to "s" (would you suggest S for sin?), both vav and vet now map to v (would you suggest V for vet? or better, the academy's "exact" taatik using w for vav) both final letter and middle letter map to the same symbol (do you suggest M for mem sofit?) and more. the official "simple" hebrew academy taatik also does this for kaf/kof->k, aleph/ain->' (geresh), tet/tav->t and tsadi/tav+samekh->ts. therefore i don't believe we should try here to be "holier than the pope", and i would be weary of trying the invent a new standard that is a mishmash between the official "simple" and "exact" taatik [1]. (btw, if we did want a 1-1 mapping a actually recently developed a 1-1 hebrew leet version (based on [2]). this puts emphasis on graphics instead of phonetics. but i dont think this is what we want here. namely due to the fact that the normalization changes to rendering direction to LTR.). therefore my suggestion is 1. follow to acadamy simple taatik is closely as possible. 2. possibly use caps only to differentiate between different phonetic readings (H, SH and KH). 3. make a special exception for alef and ain. Otherwise, we should go over the hebrew block and hebrew presentation forms block again, and fix other ambiguities, and if following the "exact" taatik as you suggest for ain (backtick) and kof (q), we should also prefer other choices there as w for vav. then use caps to address many remaining ambiguities.

[1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf [2] https://he.wikipedia.org/wiki/%D7%A1%D7%9C%D7%A0%D7%92_%D7%91%D7%90%D7%99%D7%A0%D7%98%D7%A8%D7%A0%D7%98

alonbl commented 3 years ago

You are focused in the special cases, punctuation and such that do not actually exist in modern Hebrew language. Even if modern Hebrew had been written with punctuation, the logic of transforming full punctuated text to phonetic would have been much more complex than 1:1 character transformation. This is not the use case of tools such as unidecode.

For the record, in the current implementation samech is 's' and Shin is 'SH' to allow distinguish between the two.

I clearly stated the use case: ability to read Hebrew text while in Latin charset.

If you use links web browser in Linux in text terminal and browse google.com you will notice it also performs conversion, for example: X+J+P+W+Sh B+Google text which is "חיפוש ב-Google", notice the 1:1 conversion. Another example from the same page: J+W+T+R+ M+Z+L+ M+ShK+L+ which is "יותר מזל משכל". Notice also the + which enables people to differentiate between Latin and Hebrew, and notice the Sh for 'ש' as reference.

I believe you are in the wrong project trying to make the punctuation work with a tool which is character to character processor, while even if you would have offered such a logic for a fully punctuated text, you should have also provided the option of simplified Hebrew conversion as it exists now.

Thank you for helping improve Hebrew special cases and cleanup of invalid chars and Yiddish, I believe this patch can be merged as-is and this discussion may continue elsewhere.

eyaler commented 3 years ago

i am not trying to make punctuation work. i just want unidecode hebrew transformations to be self-consistent and as close as possible to official standards.

alonbl commented 3 years ago

It is impossible to match the standards without an AI.

Let's take [1], and let's agree we are using the precise model and agree we handle transcript without punctuation.

How can we know if י or ו is pronounced to match the standard? How can we know if א is 'a or 'e? Do we want to emit ע from head of word? Too much exceptions. As you can see ס is 's' and ש is Sh to avoid conflict, notice the capital S. And צ is Ts notice the capital T which is not phonetic as you imagine. Notice the ק which in the precise notation is q, in this case we can modify it to the new k while changing the כ to KH notation as suggested as well.

[1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf

alonbl commented 3 years ago

In other words these standards are incomplete and are for human translations and not for machines.

alonbl commented 3 years ago

I added the כ->KH, ק->k transformation.

eyaler commented 3 years ago

כ should not be kh (that is only for כ rafa). if this is not indicated כ is always transliterated as k (although for final ך it would probably be ok to also use KH, and the same goes for final ף which should be f, btw) י and ו are treated as consonants. so they are Y and V (or W in the exact scheme). implied vowels are not transliterated in what we are doing here. you still have the case of sin with sin dot that is currently transliterated to small s. anyway i feel this back and forth is not very efficient. i would be happy to have a live discussion and i am open to aligning with your general approach but i still would like it to be as consistent as possible with itself and with the standards. i can make my own PRs later, but i think it would be best to get to an agreed base line.

alonbl commented 3 years ago

I truly do not understand how can you distinguish between ו that is used as a vowel or consonant to meet the standard. The standard[1] clearly states that vowel should be omitted.

I can accept a translate of SIN (vs SHIN) to S and not SH but not to s as it conflict with ס, but again, you are discussing the SPECIAL characters which are not used in modern Hebrew.

I updated the document[2] to match the output. The remaining issues apart from these that are not marked as Closed are related to capitalization of conflicts, which I do not find as a major issue compared to the standard and I've showed you that other transformations are doing the same.

I believe most of the items are translated correctly now, please review this patch and let's agree if it is a progress compared to the existing implementation or not. If it is a progress, we can merge it and then discuss the remaining later.

[1] https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf [2] https://docs.google.com/spreadsheets/d/1fvQtyDxiVbz4Yp2FY1fSvZ9qVugo2KKC_yX8LofAUGU/edit#gid=0

eyaler commented 3 years ago

all letters are treated as consonants unless there is a special indication of a vowel. note that if you use S for sin, we would have a new issue of differing between SH from SHIN vs S+H from SIN+HET

i think it is a progress apart from the recent changes of כ and ט which i consider to be a step back.

כ should be k this is the more representative transliteration as it it is used in the beginning of words (compare to BET). Again we could discuss the case of final forms but not sure we want to have different representations for middle and final forms. and you cannot use K because then you cannot differ between KH from KAF RAFA to K+H from KAF/KOF + HET.

a new issue with the recent T for ט is that now we cannot know if TSH is TET+SHIN or TSADI+HET... (by the way there is also FB38 טּ)

so i suggest for now: revert all כ back to k and ט back to t. change YY to YYY (unidecode does have 3 letter representations). and let's submit it as a new hebrew baseline. Further PR's can then be limited to specific issues. thanks!

avian2 commented 3 years ago

Hi. I would be happy to merge this if you think it is an improvement compared to existing replacements.

alonbl commented 2 years ago

Hi. I would be happy to merge this if you think it is an improvement compared to existing replacements.

Hi @avian2,

I believe this is a progress that worth merging. The discussion of special characters and the ability to perform logic which is not 1:1 character translation but based on dictionary or some other rules can be done later.

Thanks!

avian2 commented 2 years ago

Merged and released in Unidecode 1.3.0. Thanks!