PhilterPaper / Text-KnuthPlass

Text::KnuthPlass paragraph shaping package for Perl
https://www.catskilltech.com/FreeSW/product/Text%2DKnuthPlass/title/Text%3A%3AKnuthPlass/freeSW_full
Other
1 stars 1 forks source link

Word splitting in non-English languages #2

Open PhilterPaper opened 3 years ago

PhilterPaper commented 3 years ago

I am aware that some other languages, such as Dutch and German, have some specific rules about changing or repeating letters when a word is split. These rules will need to be built into either Text::KnuthPlass itself (which in turn needs to be made aware of the human language being used), or possibly into a code layer involved with paragraph shaping and such. It might even be an extension to Text::Hyphen or other hyphenation code. Currently, you need to invoke the appropriate Text::Hyphen::XX (XX is the language code) to get the right place to split a word, but I don't think it goes beyond that.

PhilterPaper commented 3 years ago

See PDF::Builder's /UniWrap.pm for code which claims to follow the Unicode rules for breaking lines (and words?) according to the script (alphabet) in use. It might be useful for Text::KnuthPlass in dividing up lines in places other than within a word, and/or for non-Latin text.

PhilterPaper commented 2 years ago

UniWrap.pm does not appear to be used anywhere in PDF::Builder, and may be quite obsolete (when compared against the classes table in https://unicode.org/reports/tr14/). This UniCode page does mention quite a few cases of how to handle line splitting, and could be a good starting point (such as for updating UniWrap).

PhilterPaper commented 2 years ago

See PhilterPaper/Perl-PDF-Builder#183 for further thoughts on hyphenation for non-English languages (both Latin alphabet and not).

PhilterPaper commented 1 year ago

See Alex Holkner's thesis (https://citeseerx.ist.psu.edu/pdf/ee95750a9dd047b52901efda59819864bb9ede4a) on page 11, for some interesting thoughts on how to represent splitable words, including those with German/Dutch orthography. In any case, you can't simply break the word into syllables -- you need to indicate if there's any "funny business" where the word is split or is put together, which has an effect on counting lengths of fragments.