commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.62k stars 542 forks source link

Line breaking in a code span can break the block structure #323

Open egrimley opened 4 years ago

egrimley commented 4 years ago

In CommonMark rendering, line breaking in a code span can break the block structure. Here are just a few examples:

$ printf '`xx --- xx`' | cmark -t commonmark --width 3 | cmark -t html
<h2>`xx</h2>
<p>xx`</p>
$ printf '`xx ``` xx`' | cmark -t commonmark --width 3 | cmark -t html
<p>`xx</p>
<pre><code>xx`
</code></pre>
$ printf '`xx + x xx`' | cmark -t commonmark --width 3 | cmark -t html
<p>`xx</p>
<ul>
<li>x
xx`</li>
</ul>

In case anyone wants to discuss, here is my attempt to specify how I think CommonMark rendering should work:

Line breaking is only implemented inside a paragraph, not inside a heading or an HTML block.

Except inside raw HTML (which I've not thought about), any group of consecutive whitespace characters will turn into a single space or a newline followed by the current line prefix.

Break each line at the last breakable space that doesn't make the line too long. So a line will be wider than the specified width only if it contains no breakable spaces.

In code spans a space is breakable if and only if the following group of non-space characters does not match any of the following Perl regular expressions:

/^[#*-=_]+$/      thematic break, heading
/^(```|~~~)/      fenced code block
/^[<>]/           HTML block, block quote
/^[*+-]$/         list item
/^[0-9]+[).]$/    list item

In textual content all spaces are breakable. A backslash is inserted for the following groups of characters:

/[*<[\\_`]/       escape, code, emphasis, links, images, raw HTML
/[&][#A-Za-z]/    entity or numeric character reference
/^[#*-=_]+$/      thematic break, heading
/^(```|~~~)/      fenced code block
/^[<>]/           HTML block, block quote
/^[*+-]$/         list item
/^[0-9]+[).]$/    list item

In these expressions, ^ means the start of a line (perhaps following a line prefix) and $ means space or the end of the line.

The backslash is inserted in front of each group, except in the last case, where it is inserted in front of the parenthesis or dot.

I've almost certainly forgotten something in there.

jgm commented 4 years ago

Line 380 of commonmark.c:

    OUT(cmark_node_get_literal(node), allow_wrap, LITERAL);

If we wanted to be fancier, we could replace this with a loop over characters in the string returned by cmark_node_get_literal(node), and AND allow_wrap with a boolean generated dynamically depending on the following content.

This adds some complexity but might be worth while. Maybe you'd like to suggest a PR.

Note that the actual wrapping code is in the generic render.c, which is also used e.g. for man output and hence can't contain any logic specific to commonmark.