Line break in East Asian paragraphs

Zjl37 commented 2 years ago

Hi @ArthurSonzogni , you might not be aware that some scripts doesn't use space as word separator. Currently, there's no line break in a paragraph of Chinese text. It is always displayed in one line.

According to my knowledge, the rules of line breaking for Chinese is: Line break can be inserted anywhere in text, except before some punctuations and after some (other) punctuations. For example, we don't want to see a comma ， U+FF0C FULLWIDTH COMMA at the start of a line, nor do we want to see an opening bracket （ U+FF08 FULLWIDTH LEFT PARENTHESIS at the end of a line.

I think the rules are simmilar for Japanese and Korean, with some slightly different preferences over prohibited punctuations for line start/end.

Zjl37 commented 2 years ago

Some resources:

Approaches to line breaking, W3 document. This is a broad overview of line breaking in different writing systems. However, in this issue I would like to focus on CJK.
Section 3.1.4 Prohibition Rules for Line Start and Line End, CLREQ. This lists the prohibited punctuations for line start/end.
Line breaking rules in East Asian languages, Wikipedia

ArthurSonzogni commented 2 years ago

Thanks for opening this!

That's an interesting question. Currently, a paragraph just takes the text, split it by spaces and display the individual words in in a flexbox, with item separator with gap_x = 1.:

Element paragraphAlignLeft(std::string the_text) {
  static const auto config = FlexboxConfig().SetGap(1, 0);
  return flexbox(Split(std::move(the_text)), config);
}

Splitting with something that isn't a space is going to be much more different/challenging... We want the "space" to be displayed only in between two words of the same line, not when breaking a line, which flexbox does very well. Here we would like to insert sometimes space, sometimes invisible breaking characters. I don't have an immediate solution, nor an immediate comprehension of the context. Thanks for the links!

Maybe I should implement paragraph as a standalone element, without reusing flexbox.

Zjl37 commented 2 years ago

UAX #14: Unicode Line Breaking Alogorithm says:

In line breaking it is necessary to distinguish between three related tasks. The first is the determination of all legal line break opportunities, given a string of text. This is the scope of the Unicode Line Breaking Algorithm. The second task is the selection of the actual location for breaking a given line of text. This selection not only takes into account the width of the line compared to the width of the text, but may also apply an additional prioritization of line breaks based on aesthetic and other criteria……The third is the possible justification of lines, once actual locations for line breaking have been determined……

This gives me a clearer picture. We should not consider a break opportunity as an invisible space. Instead, we consider it as intercharacter position. Once line break is determined, we strip off all spaces at the end of every line before justification. Flexbox don't do this, so it's necessary to implement a standalone element.

ArthurSonzogni commented 2 years ago

This is not something trivial. https://unicode.org/reports/tr14/

I guess, we need to classify every character into a few categories:

Label	Meaning for the Class
(A)	It allows a break opportunity after in specified contexts.
(XA)	It prevents a break opportunity after in specified contexts.
(B)	It allows a break opportunity before in specified contexts.
(XB)	It prevents a break opportunity before in specified contexts.
(P)	It allows a break opportunity for a pair of same characters.
(XP)	It prevents a break opportunity for a pair of same characters.

Then, find a way implement layout and render appropriately.

Maybe one can reuse existing functions like: https://github.com/adah1972/libunibreak/blob/master/src/linebreak.c for the classification instead of transcribing the unicode algorithm.

I think I am going to wait a bit for becoming sufficiently crazy/brave before attempting implementing this. I will let you know if I do ;-)

WenjunHuang commented 2 years ago

Maybe use harfbuzz to shape text?

ArthurSonzogni / FTXUI

Line break in East Asian paragraphs #320