Kozea / Pyphen

Hy-phen-ation made easy
https://courtbouillon.org/pyphen
Other
198 stars 24 forks source link

Hungarian hyphenation is faulty in case of vowel-consonant-vowel-* words #61

Closed aswna closed 10 months ago

aswna commented 10 months ago

Hello, using latest pyphen (0.14.0), there seems to be an issue with the hyphenation of Hungarian words starting as vowel-consonant-vowel-*. E.g.: "alak" should be hyphenated as "a-lak" (currently not hyphenated by pyphen), or "alaktalan" as "a-lak-ta-lan" (incorrectly hyphenated as "alak-ta-lan" by pyphen).

I saw you suggested here to check with https://www.ushuaia.pl/hyphen/?ln=en (selecting language: Hungarian). The hyphenation of these type of words are also faulty there. Also checked these words in LibreOffice (7.3.7.2), it has the same issue.

Notes:

What should be used for cross-checking instead of the above is https://helyesiras.mta.hu/helyesiras/default/hyph# . Note: MTA (mta.hu) is the National Academy of Science in Hungary. The hyphenations I checked here were all correct, including the above words, too.

Thanks for looking into this!

liZe commented 10 months ago

Hi!

Thanks for this report.

I saw you suggested here to check with https://www.ushuaia.pl/hyphen/?ln=en (selecting language: Hungarian). The hyphenation of these type of words are also faulty there. Also checked these words in LibreOffice (7.3.7.2), it has the same issue.

Then it means that the problem (or maybe it’s a known limitation) comes from the dictionary. The best way to solve this is to talk with the authors of the dictionary, you’ll find more information about them in this file.

What should be used for cross-checking instead of the above is https://helyesiras.mta.hu/helyesiras/default/hyph# .

I propose to use ushuaia.pl because it uses the same dictionary as Pyphen but not the same code. So, if users have the same problem with Pyphen and ushuaia.pl, it means that the problem is in the dictionary (that we don’t maintain, and that we can’t fix), and not in the code (that we maintain and can fix.)

aswna commented 10 months ago

For the record: this behavior is due to the following note about hyphenation in the Hungarian spelling/grammatical rule book

"Az egyetlen magánhangzóból álló szókezdő és szó végi szótagot – bár önállóságát nyelvi tekintetben nem lehet elvitatni – esztétikai okokból nem szokás egymagában a sor végén hagyni, illetőleg a következő sorba átvinni" -- https://helyesiras.mta.hu/helyesiras/default/akh12#F8 (chapter 226.).

Meaning that although it is correct, it is not "nice" (in a text) to have a single vowel at the end of the line, or at the start of the (new) line.

Contacted the authors, who confirmed that this "hyph" dictionary in itself is not completely suitable for finding all the hyphenations.

László Németh suggested below "workaround", which works for simple cases:

$ /home/laci/libreoffice/workdir/UnpackedTarball/hyphen/example /home/laci/libreoffice/dictionaries/hu_HU/hyph_hu_HU.dic /dev/stdin | sed 's/^([aáeéiíoóöőuúüű])(([^aáeéiíoóöőuúüű]|cs|gy|ny|sz|ty|zs)\?[aáeéiíoóöőuúüű])/\1=\2/;s/([aáeéiíoóöőuúüű])([aáeéiíoóöőuúüű])$/\1=\2/' agyabugyál Fehérlófia a=gya=bu=gyál fe=hér=ló=fi=a

He also noted, that the above does not handle compound words, but utilizing Hunspell's morphology analysis (see "st:" and "pa:") the above can be extended:

$ hunspell -m elagyabugyál elagyabugyál ip:PREF sp:el st:agyabugyál po:vrb ts:PRES_INDIC_INDEF_SG_3

szappanopera szappanopera pa:szappan st:szappan po:noun ts:NOM pa:opera st:opera po:noun ts:NOM

Note: the correct hyphenations for the above are "el-a-gya-bu-gyál", "szap-pan-o-pe-ra" (and not "e-la-gya-bu-gyál" and "szap-pa-no-pe-ra").

aswna commented 10 months ago

Thanks!