commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 317 forks source link

Description on how soft line breaks are treated in browsers doesn't consider the combination of Firefox & Chinese/Japanese text and should be improved #744

Open tats-u opened 1 year ago

tats-u commented 1 year ago

https://spec.commonmark.org/0.30/#softbreak

The description on a soft line break looked ambiguous or questionable for me.

A soft line break may be rendered in HTML either as a line ending or as a space.

Does it mean Markdown-to-HTML converters are allowed to convert a soft line break in Markdown to either of "\n" (or possibly "\r" or "\r\n") or "`" in HTML? Generally "may" in specifications means "is allowed to and does not have to" (RFC 2119) and confused me. I have no idea when "a soft line break is rendered in HTML as a line ending". Is it whenwhitespace: preserve` or some other values is passed in CSS? Also is there a case when a soft line break is rendered in HTML as other than a line ending or a space?

The result will be the same in browsers.

This is wrong. How "\n" in HTML is rendered differs among browsers when Chinese or Japanese are contained.

https://drafts.csswg.org/css-text-4/#line-break-transform

Then any remaining segment break is either transformed into a space (U+0020) or removed depending on the context before and after the break. The rules for this operation are UA-defined in this level.

This means how a soft line break is rendered depends on browsers' implementations.

In languages that have no word separators, such as Chinese, “unbreaking” a line requires joining the two lines with no intervening space.

這個段落是那麼長,
在一行寫不行。最好
用三行寫。

這個段落是那麼長,在一行寫不行。最好用三行寫。

Only Firefox follows this recommendation as of now. (However, spaces are inserted like the other browsers when copied and pasted on somewhere else!)

https://codepen.io/tats-u/pen/YzdKKyN

<p lang="zh-Hant-tw">這個段落是那麼長,
在一行寫不行。最好
用三行寫。</p>

image ↑Firefox (intended)

image ↑Edge (WebKit / Blink / IE; not intended; space after "," is selected)

https://codepen.io/tats-u/pen/poQQVyR (what kind of CJK letters remove a newline between them? → Korean is treated like alphanumeric characters unlike Japanese)

<p>잘자
(
잘자
)
잘자
잘자
あああ
(
あああ
)
ああああ
1
ああ
ああ
。
1
。</p>

image ↑ FIrefox

image ↑Edge (and other WebKit & Blink based browsers / IE)

https://spec.commonmark.org/dingus/?text=%23%23%20%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%81%A8%E4%B8%AD%E5%9B%BD%E8%AA%9E%E3%81%AE%E4%BE%8B%0A%0A%E3%81%93%E3%82%8C%E3%81%AF%E6%97%A5%E6%9C%AC%0A%E8%AA%9E%E3%81%AE%E6%96%87%E7%AB%A0%E3%81%A7%0A%E3%81%99%E3%80%82%E8%BF%99%E6%98%AF%E4%B8%80%0A%E4%B8%AA%E4%B8%AD%E6%96%87%E5%8F%A5%E5%AD%90%E3%80%82%0A%0A

image ↑Firefox (looks natural)

image ↑Edge (Wekit / Blink / IE; doesn't look natural)

From these results, we can conclude only "\n" between Chinese or Japanese letters (han/kana) or punctuation marks is removed instead of replaced with " " in Firefox.

Also,

The result will be the same in browsers.

This sentence must be replaced with like:

The result will follow the CSS Text Module specification and might depend on browsers (but should be same in languages that use a space to segment words).

wooorm commented 1 year ago

The phrasing is a bit weird in my opinion, “rendered in HTML”, more like: “when compiled to HTML, a soft line break may be shown as a line ending or as a space”.

To recap this issue:

Right?

Compared to your suggestion, I don’t think it’s good to mention deep specs. How about:

- (A soft line break may be rendered in HTML either as a [line ending](https://spec.commonmark.org/0.30/#line-ending) or as a space. The result will be the same in browsers. In the examples here, a [line ending](https://spec.commonmark.org/0.30/#line-ending) will be used.)
+ (A soft line break may be shown by browsers as a [line ending](https://spec.commonmark.org/0.30/#line-ending), a space, or nothing at all. In the examples here, a [line ending](https://spec.commonmark.org/0.30/#line-ending) will be used.)

I’d also personally prefer to be a bit stronger in our markdown spec, and say that we actually specify \n -> \n (trimmed)?

tats-u commented 1 year ago

“when compiled to HTML, a soft line break may be shown as a line ending or as a space”

It is much clearer than the expression in the spec.

CSS was changed to allow browsers to be smarter in some cases

Correct. The first change is introduced in the Working Draft 15 of the Text Module Level 3 in 2011.

Otherwise, if the script context on one side of the line feed is Hangul, then the line feed is converted to a space (U+0020). Otherwise, if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A), then the line feed is removed.

The behavior changed to browsers-defined in 2021 because of https://github.com/w3c/csswg-drafts/issues/5086. A strict rule existed in the version just before it. (WebKit-based browsers and IE didn't follow it at all though)

As you know, no browsers except for Firefox have not followed since today even though more than 10 years passed. Firefox changed its behavior in 2008.

Some browsers now have different defaults, so this text in the CM spec is no longer correct

We might have to say "The CM spec has ignored the behavior of some browsers." instead. It depends on when the first CM spec before v0.5 (in 2014) was published. Firefox's current behavior has existed since 2008. I don't believe Firefox's change is earlier because Markdown seems to have been born in 2004. At least we can't say "now" because Firefox's change is as many as 15 years old.

Compared to your suggestion, I don’t think it’s good to mention deep specs.

FYI, at first I thought HTML itself had decided the rule and tried to find one in the HTML spec, but I couldn't. Finally I found it in the CSS spec instead. I do not want readers of the CM spec to repeat the same mistake. I want those who want to find the most basic specification to access to the CSS spec first instead of the HTML spec.

(A soft line break may be shown by browsers as a line ending, a space, or nothing at all. In the examples here, a line ending will be used.)

It will be clearer if we split the description in the former sentence into 2 phases:

A soft line break must be converted to (rendered as) a line ending or a space in HTML. In the examples here, a line ending will be used. A line ending in HTML is rendered as a space or simply removed by browsers.

wooorm commented 1 year ago

at first I thought HTML itself had decided the rule

For HTML, it’s all “inter-element whitespace”. CM cares about HTML, not really about CSS.

I do not want readers of the CM spec to repeat the same mistake.

Can you put this “mistake” into concrete words? What are you worried about that other people might do?

I want those who want to find the most basic specification to access to the CSS spec first instead of the HTML spec.

What?

It will be clearer if we split the description in the former sentence into 2 phases:

I don’t want to talk about CSS, just the markdown -> html part? I feel like it’s better to not touch on CSS if we don’t need it, and keep it simple?

tats-u commented 1 year ago

Can you put this “mistake” into concrete words? What are you worried about that other people might do?

I thought HTML also had a rule of how to render “inter-element whitespace" in the screen and tried to find one in the HTML spec first. I worry other people who want to find a rule like me turn the HTML spec, not the CSS spec, upside down first, too.

What?

Could you tell me what follows after that "What"?

I don’t want to talk about CSS, just the markdown -> html part?

We wouldn't have to mention CSS if the CM spec banned conversion the soft line break to other than a newline. If it allows to convert it to " " or "", we need to encourage developers of formatters (, renderers ,)and converters that convert the soft linebreak to a space or remove it to align those conversion rules with the rendering rules in browsers.

I want to the CM spec to mean either of the following two (1. or 2.):

  1. Softbreak must be converted to a newline.
    1. Softbreak must be converted to a newline or a space, or just be removed.
    2. Converting it to a newline is the most reliable and recommended way.
    3. If it is not converted to a newline, it is recommended to conform to the way browsers display line breaks.

Once we describe the details of "the way browsers display line breaks" in 2-iii, we won't be able to help mentioning the CSS spec.