Open tats-u opened 2 months ago
IMO commonmarkjs / cmark are wrong here. Spec is super clear about this.
IMO commonmarkjs / cmark are wrong here.
I agree with you.
super clear
Does your opinion come from the definition of space and tab? To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions. I think it would be clearer and stop making library authors worry about extra concerns (i.e. Unicode whitespace handling) if every "space" and "tab" are linked.
Leading [space]s or [tab]s are skipped:
[space]: #space
[tab]: #tab
I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.
Also, I think the spec should advise library authors to confirm the strict specification (what kind of characters is removed) of .trim()
method in the language library and not to use them (JS & .NET remove Unicode ones and Java removes VT & FF) in this context. I think it is a common pitfall.
I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.
It was not clear to me what this issue was about. Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?
To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.
I don’t see it. The spec is very clear about characters: https://spec.commonmark.org/0.31.2/#characters-and-lines.
Specs are always a bit technical. They must be interpreted strictly.
Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?
I want an example like the following to be available as an officially-provided example or test case:
The spec is very clear about characters
The syntax itself is so. It might be due to the word choices ("space" and "tab", both are very common words) that library authors have incorrectly implemented the non-Unicode whitespace handling. (or due to the lack of understanding of .trim()
methods)
Agreed, there should be a test case with, say, NBSP at the beginning and end of a paragraph, showing that it is not stripped.
https://github.com/commonmark/commonmark.js/issues/261 https://github.com/commonmark/commonmark.js/pull/289
https://spec.commonmark.org/0.31.2/#paragraphs
The specification doesn't mention non-ASCII Unicode whitespaces, no-break space, form feed, or vertical tab, but some implementations treats them like ASCII space and tab, and remove trailing or leading ones in paragraphs.
cmark seems to remove at least U+00A0 NBSP, (but not U+3000 or U+2003)preserves U+00A0 tooHowever cmark removes trailing Form Feeds and Vertical Tabs.
commonmark.js removes all characters trimmed by
trim()
of JS.https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar
micromark doesn't touch other than ASCII space (U+0020) or tab (U+0009).
Other implementations:
https://babelmark.github.io/?text=%0BVertical+Tab+(U%2B000B)%0B%0A%0A%0CLine+Separator+(U%2B000C)%0C%0A%0A+Space+(U%2B0020)+%0A%0A%09Tab+(U%2B0009)%09%0A%0A%C2%A0NBSP+(U%2B00A0)%C2%A0%0A%0A%E2%80%83EM+Space+(U%2B2003)%E2%80%83%0A%0A%E2%80%A8Line+Separator+(U%2B2028)%E2%80%A8%0A%0A%E2%80%A9Paragraph+Separator+(U%2B2029)%E2%80%A9%0A%0A%E3%80%80Chinese%2FJapanese+Space+(U%2B3000)%E3%80%80%0A