Kozea / Pyphen

Hy-phen-ation made easy
https://courtbouillon.org/pyphen
Other
198 stars 24 forks source link

Hyphenaton error on german word "Fortschritt" #24

Open DottoreG opened 5 years ago

DottoreG commented 5 years ago

import pyphen dic = pyphen.Pyphen(lang='de_DE') dic.inserted('Fortschritt')

results in: 'Fort-s-chritt' The correct answer would be: 'Fort-schritt'

Although Libreoffice uses the same dictionary the result seems to be correct there.

Skill3t commented 5 years ago

Same thing with medizinische it is not medizini-sche it is me-di-zi-nisch.

wimmuskee commented 4 years ago

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available). You can use a filename when calling Pyphen:

import pyphen dic = pyphen.Pyphen(filename='/usr/share/hyphen/hyph_de_DE.dic') dic.inserted('Fortschritt')

@liZe would you be interested in a PR for a fallback on system installed hyphenations? That way distro packagers could also opt to not install dictionary files, and rely on more up-to-date system hyphenations fully.

FelixSchwarz commented 4 years ago

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available).

We have a patch in Fedora which does something similar. The Fedora package does not ship any dictionaries from pyphen but that has its own drawbacks:

Therefore I'm planning to use pyphen's dictionaries in the a future update (assuming I get the privileges to update pyphen - finally).

Personally if you would support system-provided dicts I'd like to see a way how callers could choose the dictionary source to prevent test failures due to outdated system dicts.

DottoreG commented 4 years ago

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available). You can use a filename when calling Pyphen:

import pyphen
dic = pyphen.Pyphen(filename='/usr/share/hyphen/hyph_de_DE.dic')
dic.inserted('Fortschritt')

I can't see how it would make it any better. On my system I get the same (wrong) result. I'm using Gentoo with hunspell 1.7.0 and pyphen 0.9.4.

wimmuskee commented 4 years ago

system dicts may not support locale inheritance ("en_US" -> "en"). At least Fedora's setup does not.

Encountered the same issue while making a patch for Gentoo. I was thinking the rewrite the fallback mechanism so a request for lang would default to _langLANG. This would work for "de". For "en", I would pick the largest available territory dictionary.

system dicts may be outdated

Other distro's seem to have the same issues. However, now the burden to keep all dictionaries up to data falls on the Pyphen maintainers. Also, some dictionaries are not updated at upstream level.

If the patch either enables pyphen's dictionaries or the system dicts user-obversable behavior may change

I imagine changing the behaviour of pyphen will result in a version update, and perhaps resulting incompatibilities in other applications. For one part, this would be similar to introducing pyphen exceptions (where using applications would expect default python exceptions). For another part, basing unit tests on content that is provided from other sources can get tricky. Also, would you not rather mock pyphen behaviour when unit testing from another application?

Personally if you would support system-provided dicts I'd like to see a way how callers could choose the dictionary source to prevent test failures due to outdated system dicts.

Continuing on the previous point, if you have to test from another application, I would always use the filename= argument to specify a static dictionary file which can be controlled from the testing application.

mark-kubacki commented 4 years ago

A word of caution, as I see this is often done wrong: Language tag substitution and expansion doesn't work like that, adding or removing a ll in/to ll_LL. In most cases you will get away with it, but it's superficial mimicry nonetheless; if someone wanted to go down that path.

For example, Swedish and Finnish are spoken in Suomi/Finland. Removing you'd run into changing the language completely, and expanding you'd face a non-trivial choice between (here: at least) two.

https://tools.ietf.org/html/bcp47

rubenmoor commented 3 years ago
fort-s-chreib-ba-r
fort-s-chreib-ba-re
fort-s-chreib-ba-rem
fort-s-chreib-ba-ren
fort-s-chreib-ba-re-r
fort-s-chreib-ba-res
fort-s-chrei-be
fort-s-chrei-ben
fort-s-chrei-ben-d
fort-s-chrei-ben-de
fort-s-chrei-ben-dem
fort-s-chrei-ben-den
fort-s-chrei-ben-der
fort-s-chrei-ben-des
Fort-s-chrei-bens
fort-s-chreibst
fort-s-chreib-t
Fort-s-chrei-bung
Fort-s-chrei-bun-gen
Fort-s-chrei-bungs-da-tei
fort-s-chrei-te
fort-s-chrei-ten
fort-s-chrei-ten-d
fort-s-chrei-ten-de
fort-s-chrei-ten-dem
fort-s-chrei-ten-den
fort-s-chrei-ten-der
fort-s-chrei-ten-des
Fort-s-chrei-tens
fort-s-chrei-tes-t
fort-s-chrei-tet
fort-s-chrie-b
fort-s-chrie-ben
fort-s-chriebst
fort-s-chrieb-t
fort-s-chrit-t
fort-s-chrit-te
fort-s-chrit-ten
Fort-s-chrit-tes
fort-s-chrit-tes-t
fort-s-chrit-tet
fort-s-chritt-lich
fort-s-chritt-li-che
fort-s-chritt-li-chem
fort-s-chritt-li-chen
fort-s-chritt-li-cher
fort-s-chritt-li-che-re
fort-s-chritt-li-che-rem
fort-s-chritt-li-che-ren
fort-s-chritt-li-che-rer
fort-s-chritt-li-che-res
fort-s-chritt-li-ches
Fort-s-chritt-lich-keit
fort-s-chritt-lichst
fort-s-chritt-lichs-te
fort-s-chritt-lichs-tem
fort-s-chritt-lichs-ten
fort-s-chritt-lichs-ter
fort-s-chritt-lichs-tes
Fort-s-chritts
Fort-s-chritts-an-zei-ge
Fort-s-chritts-an-zei-gen
Fort-s-chritts-bal-ken
Fort-s-chritts-bal-kens
fort-s-chritts-be-geis-ter-t
fort-s-chritts-be-geis-ter-te
fort-s-chritts-be-geis-ter-tem
fort-s-chritts-be-geis-ter-ten
fort-s-chritts-be-geis-ter-ter
fort-s-chritts-be-geis-ter-tes
Fort-s-chritts-be-geis-te-rung
Fort-s-chritts-be-griff
Fort-s-chritts-be-grif-fe
Fort-s-chritts-be-grif-fen
Fort-s-chritts-be-griffs
Fort-s-chritts-be-richt
Fort-s-chritts-be-rich-te
Fort-s-chritts-be-rich-ten
Fort-s-chritts-be-richts
Fort-s-chritts-be-we-gung
Fort-s-chritts-be-we-gun-gen
Fort-s-chritt-s-club
Fort-s-chritt-s-clubs
Fort-s-chritts-den-ken
Fort-s-chritts-den-kens
Fort-s-chritts-dok-trin
Fort-s-chritts-ef-fek-t
Fort-s-chritt-s-ei-fer
Fort-s-chritt-s-ent-wick-lung
Fort-s-chritt-s-ent-wick-lun-gen
Fort-s-chritts-er-zäh-lung
Fort-s-chritts-er-zäh-lun-gen
Fort-s-chritts-fak-tor
Fort-s-chritts-fak-to-ren
Fort-s-chritts-fak-tor-s
fort-s-chritts-feind-lich
fort-s-chritts-feind-li-che
fort-s-chritts-feind-li-chem
fort-s-chritts-feind-li-chen
fort-s-chritts-feind-li-cher
fort-s-chritts-feind-li-ches
Fort-s-chritts-feind-lich-keit
Fort-s-chritts-feind-lich-kei-ten
Fort-s-chritts-för-de-rung
Fort-s-chritts-freun-d
Fort-s-chritts-freun-des
fort-s-chritts-freund-lich
fort-s-chritts-freund-li-che
fort-s-chritts-freund-li-chem
fort-s-chritts-freund-li-chen
fort-s-chritts-freund-li-cher
fort-s-chritts-freund-li-che-re
fort-s-chritts-freund-li-che-rem
fort-s-chritts-freund-li-che-ren
fort-s-chritts-freund-li-che-rer
fort-s-chritts-freund-li-che-res
fort-s-chritts-freund-li-ches
Fort-s-chritts-funk-ti-o-n
Fort-s-chritts-funk-ti-o-nen
Fort-s-chritts-ga-ran-tie
Fort-s-chritts-ga-ran-ti-en
Fort-s-chritts-ge-dan-ke
Fort-s-chritts-ge-dan-ken
Fort-s-chritts-ge-dan-ken-s
Fort-s-chritts-ge-schich-te
Fort-s-chritts-ge-schich-ten
Fort-s-chritts-glau-be
Fort-s-chritts-glau-ben
Fort-s-chritts-glau-bens
fort-s-chritts-gläu-big
fort-s-chritts-gläu-bi-ge
fort-s-chritts-gläu-bi-gem
fort-s-chritts-gläu-bi-gen
fort-s-chritts-gläu-bi-ger
fort-s-chritts-gläu-bi-ge-s
Fort-s-chritts-gläu-big-keit
Fort-s-chritts-gra-d
Fort-s-chritts-gra-de
Fort-s-chritts-gra-den
Fort-s-chritts-hy-po-the-se
Fort-s-chritts-hy-po-the-sen
Fort-s-chritts-ide-e
Fort-s-chritts-ide-en
Fort-s-chritt-s-ideo-lo-gie
Fort-s-chritt-sil-lu-sion
Fort-s-chritts-kar-te
Fort-s-chritts-kar-ten
Fort-s-chritts-klei-d
Fort-s-chritts-klub
Fort-s-chritts-klubs
Fort-s-chritts-kon-trol-le
Fort-s-chritts-kon-trol-len
Fort-s-chritts-kon-zep-t
Fort-s-chritts-kon-zep-te
Fort-s-chritts-kri-ti-k
Fort-s-chritts-kri-ti-ken
Fort-s-chritts-kri-ti-ker
Fort-s-chritts-kri-ti-ke-rin
Fort-s-chritts-kri-ti-ke-rin-nen
Fort-s-chritts-kri-ti-kern
Fort-s-chritts-kri-ti-ker-s
Fort-s-chritts-kur-ve
Fort-s-chritts-kur-ven
Fort-s-chritts-leis-te
Fort-s-chritts-mes-sung
Fort-s-chritts-mes-sun-gen
Fort-s-chritts-mo-dell
Fort-s-chritts-mo-del-le
Fort-s-chritts-mo-dells
Fort-s-chritts-my-then
Fort-s-chritts-my-thos
Fort-s-chritt-s-op-ti-mis-mus
fort-s-chritt-s-o-ri-en-tier-t
fort-s-chritt-s-o-ri-en-tier-te
fort-s-chritt-s-o-ri-en-tier-tem
fort-s-chritt-s-o-ri-en-tier-ten
fort-s-chritt-s-o-ri-en-tier-ter
fort-s-chritt-s-o-ri-en-tier-tes
Fort-s-chritts-par-tei
Fort-s-chritts-par-tei-en
Fort-s-chrittspes-si-mis-mus
Fort-s-chritts-pro-jek-t
Fort-s-chritts-pro-jek-te
Fort-s-chritts-pro-jek-ten
Fort-s-chritts-pro-jek-tes
Fort-s-chritts-pro-jekt-s
Fort-s-chritts-pro-zess
Fort-s-chritts-pro-zes-se
Fort-s-chritts-pro-zes-sen
Fort-s-chritts-pro-zes-ses
Fort-s-chritts-punk-t
Fort-s-chritts-punk-te
Fort-s-chritts-punk-ten
Fort-s-chritts-punk-tes
Fort-s-chritts-quo-te
Fort-s-chritts-quo-ten
Fort-s-chritts-re-ak-ti-o-n
Fort-s-chritts-re-ak-ti-o-nen
Fort-s-chritts-schwei-ne-hun-de
Fort-s-chritts-schwei-ne-hun-den
Fort-s-chritts-schwei-ne-hun-des
Fort-s-chritts-sucht
fort-s-chritt-s-t
Fort-s-chritts-ten-denz
Fort-s-chritts-ten-den-zen
Fort-s-chritts-the-o-rie
Fort-s-chritts-the-o-ri-en
Fort-s-chritts-trau-ma
Fort-s-chritts-trau-mas
Fort-s-chritts-über-wa-chung
Fort-s-chritts-uni-o-n
Fort-s-chritts-uni-o-nen
Fort-s-chritt-s-u-to-pie
Fort-s-chritts-ver-fol-gung
Fort-s-chritts-ver-wei-ge-rer
Fort-s-chritts-ver-wei-ge-rern
Fort-s-chritts-ver-wei-ge-rer-s
Fort-s-chritts-ver-wei-ge-rung
Fort-s-chritts-ver-wei-ge-run-gen
Fort-s-chritts-vor-stel-lung
Fort-s-chritts-vor-stel-lun-gen
Fort-s-chritts-vor-ur-teil
Fort-s-chritts-vor-ur-tei-le
Fort-s-chritts-vor-ur-tei-len
Fort-s-chritts-vor-ur-teils
Fort-s-chritts-werk
Fort-s-chritts-wer-ke
Fort-s-chritts-wer-ken
Fort-s-chritts-werks
Fort-s-chritts-wer-tung
Fort-s-chritts-wer-tun-gen
Fort-s-chritt-szahl
Fort-s-chritt-szah-len
Fort-s-chritt-szah-len-kon-zep-t
Fort-s-chritt-szah-len-kon-zep-te
Fort-s-chritt-szah-len-kon-zep-ten
Fort-s-chritt-szah-len-kon-zept-s
Fort-s-chritt-szeit-ver-fah-ren
Fort-s-chritt-szeit-ver-fah-rens
Fort-s-chritt-szif-fer
Fort-s-chritt-szif-fern
Fort-s-chrittszu-stan-d
Fort-s-chrittszu-stan-des
Fort-s-chritt-szweif-ler
Fort-s-chritt-szweif-le-rin
Fort-s-chritt-szweif-le-rin-nen
Fort-s-chritt-szweif-lern
Fort-s-chritt-szweif-ler-s

(faulty utput from dic.inserted)

I would like to understand how the wrong hyphenation comes about. This doesn't seem to be about the .dic file, really. The single s as a syllable doesn't make too much sense to me.