HOST-Oman / scribus

Project for adding complex text layout to Scribus DTP program
Other
36 stars 21 forks source link

Hyphenation inside ligatures #144

Closed khaledhosny closed 8 years ago

khaledhosny commented 8 years ago

Check if we are able to break words at hyphens inside ligatures and fix it if it is not working. I suspect it does not work at the moment because we don’t do any special handling here.

Fahad-Alsaidi commented 8 years ago

Do you have a test file?

khaledhosny commented 8 years ago

Here is one, the first frame has plain “f” characters which are not ligated on trunk and thus hyphenated, but we ligate them and apply the hyphen after the ligature (which is wrong).

The second frame contains a Unicode ff ligature (the only way to get a ligature in trunk, though it is a legacy Unicode character) and neither trunk nor CTL branch is able to hyphenate it.

hyphenation-ligature.sla.gz

luzpaz commented 8 years ago

Ticket has typo, shoul be "Hyphenation"

andreas-vox commented 8 years ago

My recommendation for now: don't hyphenate inside ligatures.

If the shaper returns a ligature, remove the "Hyphenation possible" flag (or better: use the flag from the logically last ligated character). If the text contains a SmartHyphen character, disable ligatures at that point.

That should give correct results (at least for Latin. No ideas about hyphenation in Arabic or Indic)

khaledhosny commented 8 years ago

I was considering this as a last resort. Arabic does not generally use hyphenation, Indic does but we haven't tested that and I have no much experience with it.

andreas-vox commented 8 years ago

I think not hyphenating when you have to break up a nice ligature is always a good solution. E.g. in German "stoff-lich": it shouldn't make a big difference if this break opportunity is skipped, and if it is, the user can still manually insert a soft hyphen (IIRC in Scribus we set the "Hyphenation_Possible" flag only for automatic hyphenation; if the user requests a hyphenation point, Scribsu inserts a soft hyphen character). And AFAIK we don't handle hyphenation like "Drucker" -> "Druk-ker" anyway. That's something for another day.

andreas-vox commented 8 years ago

BTW, how does Harfbuzz handle soft hyphens? Are they even translated to glyphs?

khaledhosny commented 8 years ago

Yes, I have #145 for things like "Drucker" -> "Druk-ker", but yes that is for another day (or project). Supporting that will likely be the same as breaking inside a ligature as you will need to measure the text before and after the hyphenation and not assume it is the same.

khaledhosny commented 8 years ago

Soft hyphen (like other Unicode control character) will be returned as a space glyph with its width set to zero. Now the problem is that if it is inside a ligature, HarfBuzz will form the ligature (as it should) and output the glyph for soft hyphen after the ligature and it will have the same cluster as that of the ligature. so f<soft-hyphen>i with result in the glyphs <f_i><space> and all have the cluster 0 (of the f, the first character in the ligature.

We can work around this by putting ZWNJ next to the manual soft hyphen which will prevent the ligature, but this will prevent it even if no line breaking will happen here which I don’t really like.

andreas-vox commented 8 years ago

I want to rewrite the whole layout routine at some time anyway in order to support a more HTML/CSS style layout. Then we can reconsider hyphenation before shaping. This is similar to using alternates for justification: you have to try different shaper settings in order to find the best solution.

khaledhosny commented 8 years ago

Yes, I think a rewrite is inevitable.

andreas-vox commented 8 years ago

Ok, I looked at the code, and the following should work:

before sending the chars to harfbazz, add a ZWNJ (not Behdad's brother! :-) ) to each soft hyphen.

after getting the result from harfbuzz, check for ligatures. set the flags from the last non-control character in the cluster as the layoutflags for this GlyphCluster (or at least copy the HyphenationPossible flag).

That should provide hyphenation outside ligatures as before and no hyphenation inside ligatures.

andreas-vox commented 8 years ago

Ok, I implemented that and it seems to work. How do I push my code?

khaledhosny commented 8 years ago

A pull request against this repository should be fine.

andreas-vox commented 8 years ago

PR is there. It contains two additional small changes to avoid error/warning on OSX

khaledhosny commented 8 years ago

I’m happy with this workaround, until a proper solution for hyphenating ligatures is implemented.

andreas-vox commented 8 years ago

Hi,

in many cases of hyphenation points you are not supposed to use ligatures: compound words in German and English. See https://en.wikipedia.org/wiki/Zero-width_non-joiner for examples.

If you don’t have more pressing issues for me, I’d like to work on shaping text per paragraph and caching the results in StoryText. I also have ideas for storing XML structure in storytext and how to feed it to shaper and layouter.

/Andreas

From: Khaled Hosny [mailto:notifications@github.com] Sent: Mittwoch, 4. Mai 2016 18:08 To: HOST-Oman/scribus Cc: andreas-vox; Comment Subject: Re: [HOST-Oman/scribus] Hyphenation inside ligatures (#144)

I’m happy with this workaround, until a proper solution for hyphenating ligatures is implemented.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/HOST-Oman/scribus/issues/144#issuecomment-216914181 Image removed by sender.

sommerluk commented 8 years ago

Hi @andreas-vox

I’m glad to hear that you want to do some about hyphenation and ligatures.

It would be great if Scribus could make typographically correct ligatures automatically, a little bit like hyphenation: Depending on the language of the text, Scribus decides where ligatures have to be suppressed (German example: Auflage without fl ligature) because it’s a word boundary, and where ligatures can be used (German example: Bierflasche with fl ligature). So the user would not have to do this manually. German is probably the language that most would benefit from this feature.

Of course, this is an additional feature, but if the code for hyphenation and ligatures is reworked now, maybe that could be a good moment for such a feature.

There are yet quite some elements available for this work. There is a project that is responsible for the new German hyphenation pattern files for TeX:

http://projekte.dante.de/Trennmuster/

They maintain a really big and high quality word list. It contains all the hyphenation points for each word. But it also contains additional information, like if this is a hyphenation point between a prefix and the word stem, or between two word stems, or between a suffix and word stem … If you want to know where to suppress ligatures, you would ignore the hyphenation points within stems, but you would consider the hyphenation points between stems or between prefix and stem … This information should be enough to create good patterns not only for hyphenation, but also for ligatures. (These ligature patterns have the same format as the hyphenation pattern)

Some months ago, I’ve written an InDesign script (JavaScript) to provide automatic ligature setting. So I’ve created such ligature pattern for usage within this script. Though the UI of the script is not polished, it is working:

http://sommerluk.github.io/Ligatursatz/

Do you think something similar could be interesting for Scribus? Would it be technically possible?

I have no knowledge about Scribus’ C++ code, but maybe there are other things that I can do?

khaledhosny commented 8 years ago

We support turning off OpenType features for any sequence of characters using character styles, so I think a script can be easily written to do this and it would be easier to update and maintain than jacking the C++ code directly. If someone can work on such a script and prove that a workable solution is found, we can later incorporate it directly into the code.

sommerluk commented 8 years ago

Could this create side-effects? How exactly would it work? Does Scribus CTL segment the text and shapes each part with a different character style separately, so that there will never be a ligature between text segments with different character styles, even if both character styles have ligatures enabled? I suppose I have to mark either the character before the break or the character after the break with a character style that supresses ligatures, right?

I suppose both possibilities can create problems because they prevent not only ligatures at the breaking point, but also one character before or after the breaking point. German example: Breaking the ffi ligature.

“auffinden” should to be broken “auf-finden” (without ffi ligature, but with fi ligature)

“Baustoffindustrie” should to be broken “Baustoff-industrie” (without ffi ligature, but with ff ligature)

What I do in the my InDesign script is simply adding (or removing) zero width non-joiner characters (ZWNJ, U+200C) within the text. Is this an option also for Scribus CTL? The advantage is that it is tecnically proper. The disadvantage is that it inserts many ZWNJ characters in the text, and this makes text editing less comfortable.

Maybe I try to play a little bit with the Scribus Scripter when I’m back home (in one week).

I have a few more questions:

– Do you think introducing a ZWNJ character can be a working solution?

– In 1.5.1, I can add ZWNJ directly in the canvas (with my keyboard that has this character). However, the story editor seems to not correctly support ZWNJ. Has this changed in Scribus CTL?

– Does Scribus CTL support characters outside of Unicode BMP?

– Are there pre-compiled builds of Scribus CTL which I could use for testing?

khaledhosny commented 8 years ago

Some character styles break the shaping e.g. (old) small caps and all caps, superscripts, subscripts. Other styles don’t break the shaping e.g. color, font features, underlining.

You will need to mark one of characters around the break to suppress the ligature. So in “auffinden” you can mark the first f, and in “Baustoffindustrie” you have to mark the i.

ZWNJ should work, but I think it might break the hyphenation (though I haven’t tested the hyphenation part). The appeal of using character style is that it does not change the underlying text.

Some stuff changed, bit can you elaborate on how the story editor does not support ZWNJ?

Charters outside BMP are supported.

There are Mac OS X builds here: https://sourceforge.net/projects/scribus/files/scribus-svn/CTL/ (though can be a bit outdated, @luzpaz knows more about them). There are also OpenSUSE builds here http://download.opensuse.org/repositories/home:/ftake:/scribus:/CTL/ by @ftake.

andreas-vox commented 8 years ago

Hi,

currently we replace Unicode soft hyphens with ZWNJ before shaping. We don’t use Unicode soft hyphens for automatic hyphenation, so that is not affected. If the shaper forms a ligature around a break point, that break point is forgotten. So in effect we don’t break inside ligatures and when a user requests a break point via soft hyphen, it’s not ligated. Introducing more ZWNJ shouldn’t make problems.

We should add ZWNJ to the Insert -> Spaces & Breaks menu.

The story editor is pure Qt, so if it ignores your keyboard ZWNJ we can’t do much about it. The Insert menu should still work.

/Andreas

Am 08.05.2016 um 17:12 schrieb Lukas Sommer notifications@github.com:

Could this create side-effects? How exactly would it work? Does Scribus CTL segment the text and shapes each part with a different character style separately, so that there will never be a ligature between text segments with different character styles, even if both character styles have ligatures enabled? I suppose I have to mark either the character before the break or the character after the break with a character style that supresses ligatures, right?

I suppose both possibilities can create problems because they prevent not only ligatures at the breaking point, but also one character before or after the breaking point. German example: Breaking the ffi ligature.

“auffinden” should to be broken “auf-finden” (without ffi ligature, but with fi ligature)

“Baustoffindustrie” should to be broken “Baustoff-industrie” (without ffi ligature, but with ff ligature)

What I do in the my InDesign script is simply adding (or removing) zero width non-joiner characters (ZWNJ, U+200C) within the text. Is this an option also for Scribus CTL? The advantage is that it is tecnically proper. The disadvantage is that it inserts many ZWNJ characters in the text, and this makes text editing less comfortable.

Maybe I try to play a little bit with the Scribus Scripter when I’m back home (in one week).

I have a few more questions:

– Do you think introducing a ZWNJ character can be a working solution?

– In 1.5.1, I can add ZWNJ directly in the canvas (with my keyboard that has this character). However, the story editor seems to not correctly support ZWNJ. Has this changed in Scribus CTL?

– Does Scribus CTL support characters outside of Unicode BMP?

– Are there pre-compiled builds of Scribus CTL which I could use for testing?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/HOST-Oman/scribus/issues/144#issuecomment-217725943

khaledhosny commented 8 years ago

Story editor (and also the canvas) used to ignore inserting characters it thought the font does not support, ZWNJ might fall into this category since most fonts don’t have a glyph for it, but it is a control character and does not need a glyph in the font. I removed these checks a while ago and any character inserted will be accepted having a glyph or not.

sommerluk commented 8 years ago

Hi.

Thanks for your explications and advice. Back home next week I’ll start some work on this. I think I’ll try it with the ZWNJ solution. Reasons:

– It’s probably easier to implement.

– With the character style solution, it would be necessary to implement an algorithm with a lot of magic to determine the place where the character style is applied. And it would make problems with words like “Kunststoffflasche” that has to break “Kunst-stoff-flasche” with both ff and fl ligature, but no ffl ligature.

– It is probably possible to switch from “ZWNJ insertions” to another, better solution later and keep most of the underlying script code like pattern algorithms and so on.

Indeed ZWNJ breaks hyphenation and also spell checking (at least in 1.5.1), but this could also be considered as a bug of the hyphenation and spell checking engine.

Give me some weeks to work on it. (I’ll have to learn Python first.)

BTW: Yes, adding ZWNJ to the Insert -> Spaces & Breaks menu would be great!

khaledhosny commented 8 years ago

I just checked and ZWNJ does not break hyphenation. It seems that libhyphen handles it just fine and since we now segment words with ICU break iterator (instead of custom regular expressions as in trunk) we also handle it just fine. So I think ZWNJ is the way to go.

khaledhosny commented 8 years ago

ZWJ and ZWNJ are available now in Insert → Character menu.

luzpaz commented 8 years ago

when this is merged in to trunk this ticket (https://bugs.scribus.net/view.php?id=14040) needs to be closed