avian2 / unidecode

ASCII transliterations of Unicode text - GitHub mirror
https://pypi.python.org/pypi/Unidecode
GNU General Public License v2.0
533 stars 62 forks source link

Suggested changes, fixes and updates to Hebrew transliteration #67

Open eyaler opened 3 years ago

eyaler commented 3 years ago

I would like to ask for @alonbl feedback/greenlight before preparing my PR. I am interested in addressing several issues I see in the current Hebrew transliteration:

  1. 05ef (triple yod)- can now be transliterated as YYY
  2. seems inconsistent to me to have raffe as - and dagensh to '. if we are going by the graphics then dagesh should be . (dot). but i think a more useful choice would be to ignore both of them (as is currently done for the Shin-dots)
  3. Better alignment with Hebrew Language Academy rules (https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf): a. 05d7 ח is never transliterated as KH - a more standard-compliant version would be H (to differ from h) or h b. it is inconsistent to transliterate א as A and ע as back-tic. ע could be A or 'A or A'. but mind you all these choices including for א are non standard. also back-tic for ע is from the "exact standard", but we are otherwise following here the "simple standard" which uses '. I am really not sure what is the right thing to do here. we could also follow other languages and use the letter name in these cases: ALEPH and AYIN. c. using @ for schwa is consistent with the IPA symbol but it is not useful and not part of the hebrew standard which ignores schwa in transliteration (or in some cases uses e) d. ק should be k as in the simple standard (q is used in the exact standard)
  4. i am not sure what are 05f5, 05f6, 05f7 as they are not part of unicode afaict
  5. fixes in hebrew presentation forms (https://www.unicode.org/charts/PDF/UFB00.pdf) a. fb4f should be EL not l b. fb4e should be f not p c. fb4d should be KH not k d. fb4c should be v not b e. fb4b should be o not vo, similarly fb1d should by i not yi f. fix eg sh, ts to be SH, TS as done in regular letters g. fb47 should be k not ts (this is a mistake) h. fb41 should be s not n (this is a mistake) i. fb3e should be m not l (this is a mistake) j. fb30 currently missing should be i k. fb27 should be r not m (this is a mistake) l. add fb21, fb20 similar to the choices decided on for regular א, ע
  6. graphically sof-pasuk looks like : but for nlp tasks would be more useful to use "." or even ". " as this is the meaning of the punctuation.
eyaler commented 3 years ago

we would be happy to do the PR if the @avian2 is interested

alonbl commented 3 years ago

Hi, It would be great if you prepare a patch of proposed changes. Or at least have a clear table from char to char so it will be easier to review. Please make sure that the "special" uncommon symbols should be clearly marked as such, so people will not be confused. I am using this translation for many texts in non-hebrew displays and so far it is working quite well. Regards, Alon

avian2 commented 3 years ago

Hi

@alonbl, if you can review @eyaler 's pull request I would be happy to accept it (since some of the proposed changes touch your changes in https://github.com/avian2/unidecode/commit/81f938d9419f4b651a089a0d809bd1a0566b1329). I don't know Hebrew and can't comment on the suggested changes.

i am not sure what are 05f5, 05f6, 05f7 as they are not part of unicode afaict

If the codepoints are undefined in Unicode, please set them to None in the transliteration tables.

graphically sof-pasuk looks like : but for nlp tasks would be more useful to use "." or even ". " as this is the meaning of the punctuation.

I trust your judgment in choosing the best compromise here.

Thanks!

alonbl commented 3 years ago

On Mon, 2 Aug 2021 at 20:32 Tomaž Šolc @.***> wrote:

Hi

@alonbl https://github.com/alonbl, if you can review @eyaler https://github.com/eyaler 's pull request I would be happy to accept it (since some of the proposed changes touch your changes in 81f938d https://github.com/avian2/unidecode/commit/81f938d9419f4b651a089a0d809bd1a0566b1329). I don't know Hebrew and can't comment on the suggested changes.

I will be glad to, once there is a pull request :) Maybe I am missing something in GitHub interface?

eyaler commented 3 years ago

@alonbl didn't PR yet. hope to get to it soon. will tag. some points are a matter of view/use case/agenda and there really is no clear right choice. if you are interested alon, we can discuss. thanks guys!

eyaler commented 3 years ago

@alonbl please find table you asked for: https://docs.google.com/spreadsheets/d/1fvQtyDxiVbz4Yp2FY1fSvZ9qVugo2KKC_yX8LofAUGU

alonbl commented 3 years ago

Thanks!

I created a patch with all that I could understand, as you did not provide edit permission we will sync on code, let's narrow it down, see #68.