commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 317 forks source link

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

Open tats-u opened 2 months ago

tats-u commented 2 months ago

https://github.com/commonmark/commonmark.js/issues/261 https://github.com/commonmark/commonmark.js/pull/289

https://spec.commonmark.org/0.31.2/#paragraphs

Leading spaces or tabs are skipped: Example 222

  aaa
 bbb
<p>aaa
bbb</p>

Lines after the first may be indented any amount, since indented code blocks cannot interrupt paragraphs. Example 223

aaa
             bbb
                                       ccc
<p>aaa
bbb
ccc</p>

However, the first line may be preceded by up to three spaces of indentation. Four spaces of indentation is too many: Example 224

   aaa
bbb
<p>aaa
bbb</p>

Example 225

    aaa
bbb
<pre><code>aaa
</code></pre>
<p>bbb</p>

Final spaces or tabs are stripped before inline parsing, so a paragraph that ends with two or more spaces will not end with a hard line break: Example 226

aaa     
bbb     
<p>aaa<br />
bbb</p>

The specification doesn't mention non-ASCII Unicode whitespaces, no-break space, form feed, or vertical tab, but some implementations treats them like ASCII space and tab, and remove trailing or leading ones in paragraphs.

cmark seems to remove at least U+00A0 NBSP, (but not U+3000 or U+2003) preserves U+00A0 too

echo "`u{a0}a`u{a0}" | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e c2 a0 61 c2 a0 ┊ 3c 2f 70 3e 0a          │<p>××a××┊</p>_   │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

However cmark removes trailing Form Feeds and Vertical Tabs.

PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ef bb bf 67 75 61 72 64 ┊ 0a 0a 0b 56 54 0b 0b 0a │×××guard┊__•VT••_│
│00000010│ 0c 46 46 0c 0c 0a 65 6e ┊ 64 0d 0a                │_FF___en┊d__     │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e 67 75 61 72 64 ┊ 3c 2f 70 3e 0a 3c 70 3e │<p>guard┊</p>_<p>│
│00000010│ 0b 56 54 0a 0c 46 46 0a ┊ 65 6e 64 3c 2f 70 3e 0a │•VT__FF_┊end</p>_│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

commonmark.js removes all characters trimmed by trim() of JS.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar

micromark doesn't touch other than ASCII space (U+0020) or tab (U+0009).

Other implementations:

https://babelmark.github.io/?text=%0BVertical+Tab+(U%2B000B)%0B%0A%0A%0CLine+Separator+(U%2B000C)%0C%0A%0A+Space+(U%2B0020)+%0A%0A%09Tab+(U%2B0009)%09%0A%0A%C2%A0NBSP+(U%2B00A0)%C2%A0%0A%0A%E2%80%83EM+Space+(U%2B2003)%E2%80%83%0A%0A%E2%80%A8Line+Separator+(U%2B2028)%E2%80%A8%0A%0A%E2%80%A9Paragraph+Separator+(U%2B2029)%E2%80%A9%0A%0A%E3%80%80Chinese%2FJapanese+Space+(U%2B3000)%E3%80%80%0A

wooorm commented 2 months ago

IMO commonmarkjs / cmark are wrong here. Spec is super clear about this.

tats-u commented 2 months ago

IMO commonmarkjs / cmark are wrong here.

I agree with you.

super clear

Does your opinion come from the definition of space and tab? To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions. I think it would be clearer and stop making library authors worry about extra concerns (i.e. Unicode whitespace handling) if every "space" and "tab" are linked.

Leading [space]s or [tab]s are skipped:

[space]: #space
[tab]: #tab

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space. Also, I think the spec should advise library authors to confirm the strict specification (what kind of characters is removed) of .trim() method in the language library and not to use them (JS & .NET remove Unicode ones and Java removes VT & FF) in this context. I think it is a common pitfall.

wooorm commented 2 months ago

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.

It was not clear to me what this issue was about. Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.

I don’t see it. The spec is very clear about characters: https://spec.commonmark.org/0.31.2/#characters-and-lines.

Specs are always a bit technical. They must be interpreted strictly.

tats-u commented 2 months ago

Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

I want an example like the following to be available as an officially-provided example or test case:

https://github.com/commonmark/commonmark.js/pull/289/files#diff-76b8b4c10de6a516d009bda44d45cd15db413db5eb2bbbd7f3376d3ad8ee6b8c

The spec is very clear about characters

The syntax itself is so. It might be due to the word choices ("space" and "tab", both are very common words) that library authors have incorrectly implemented the non-Unicode whitespace handling. (or due to the lack of understanding of .trim() methods)

jgm commented 1 month ago

Agreed, there should be a test case with, say, NBSP at the beginning and end of a paragraph, showing that it is not stripped.