commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.88k stars 317 forks source link

Inline HTML comments at start of line #760

Open dubiousjim opened 9 months ago

dubiousjim commented 9 months ago

According to the current spec, the HTML comment in this source is an inline element:

preceding line
text <!-- comment --> more text
following line

But the line containing the HTML comment in this source is an HTML block element (meeting condition 2 in section 4.6):

preceding line
<!-- comment --> more text
following line

(The same holds if the comment is preceded by 1-3 spaces.) Intuitively, though, one would expect the comment in the second example to still be inline. (I was surprised by how some of my markup was being rendered, and traced the difficulty back to here. It forces me to pay attention in my source text to where a comment occurs on a line, or where line breaks happen in a longer paragraph containing comments.)

Perhaps consider changing the Commonmark spec so that HTML block condition 2 requires instead:

Start condition: line begins with the string <!--. End condition: line contains the string -->, and the first occurrence of --> in the string is at the end of the line.

Perhaps the End condition could allow whitespace to occur after the -->.

jgm commented 9 months ago

I agree that this is surprising behavior, and undesirable (for example, if you rewrap a paragraph you might easily make an inline HTML comment into block HTML).

Your proposal would be one way to go. More radically, one could disallow raw HTML blocks from interrupting paragraphs. (But there may be a reason not to do this?)

dubiousjim commented 9 months ago

Not sure if there are reasons to avoid the more radical solution. I can imagine scenarios where I'd welcome the more radical solution (e.g., I might want an HTML comment to occupy an entire line of a running paragraph, but not to introduce a paragraph break). I can't easily imagine scenarios where I'd regret the more radical solution --- if I wanted a HTML block element I could always just ensure it's surrounded by block boundaries e.g. blank lines.

What if there's a block boundary (e.g. blank line or edge of the document) on one side of a comment, and running text on the other side? Then you count as an inline element with the adjacent paragraph? That'd be the least surprising.

dubiousjim commented 9 months ago

If you do go with (something like) my more conservative proposal, you should be careful not to parse a line whose text is exactly <!--> as an HTML comment.

The commonmark parser currently in pandoc (correctly) treats that as the start of unclosed comment.

$ printf 'abc\n<!-->\ndef' | pandoc -f commonmark -t native
[ Para [ Str "abc" ]
, RawBlock (Format "html") "<!-->\ndef\n\n"
]
wooorm commented 9 months ago

Why? According to HTML that’s a comment.

if I wanted a HTML block element I could always just ensure it's surrounded by block boundaries e.g. blank lines.

In my experience, many markdown authors are kinda messy, and often don’t use blank lines.


I’m not so sure about this. The downside isn’t that bad. With downside, I mean that in a\n<!-- b --> *c*, the asterisks are shown instead of turning into emphasis. Because markdown syntax is readable, people will still understand it’s intended as a an emphasized word. The current alternative solution, wrapping differently, as in placing the \n somewhere else, is fine in my opinion. As that’s also the solution for many other things, such as a > or * or # or whatnot at the start of a line.


There are also others kinds of “HTML” “blocks” than comments in CommonMark. Do you only want to change comments?

dubiousjim commented 9 months ago

Why? According to HTML that’s a comment.

Huh, I didn't realize that. Have just confirmed that Chrome will render <!--> as a closed comment. Not what I expected. That doesn't mean this is what the HTML spec says, but I'll take your word for it.

There are also others kinds of “HTML” “blocks” than comments in CommonMark. Do you only want to change comments?

These are the only HTML blocks where the subtleties of the parsing matter to me. For other raw html, I'd probably always make it a block set off by blank lines. I'd be happy with a good solution to the parsing of comments, that is then extrapolated to the other elements in the way that seems to interested parties to make most sense / be most natural. Maybe that means changing all the elements to have consistent parsing rules. Maybe it means treating comments specially. I don't have an opinion.

dubiousjim commented 9 months ago

I’m not so sure about this. The downside isn’t that bad. With downside, I mean that in a\n<!-- b --> *c*, the asterisks are shown instead of turning into emphasis. Because markdown syntax is readable, people will still understand it’s intended as a an emphasized word.

The kind of issue I had is I'd have source text like this:

* abc
  <!-- def --> ghi
  jkl

And then the ghi\njkl would no longer be parsed as belonging to the list element, but as a new paragraph after the list.

jgm commented 9 months ago

In my experience, many markdown authors are kinda messy, and often don’t use blank lines.

True, but there are block-level elements that we currently don't allow to interrupt paragraphs: notably, ordered lists (other than those beginning with 1.). And it would be relatively natural to insist that HTML blocks begin with a blank line (or container edge), since they generally have to end with a blank line.

The only down side I see is that in things like

foobar
<div>
baz
</div>

the tags would be parsed as inline HTML. This would be a breaking backwards-incompatible change, which of course is a big strike against it. But also easily fixed by inserting a blank line.

The thing I don't like about the more moderate proposal is that it special-cases comments. Why should comments behave differently from other raw HTML?

dubiousjim commented 9 months ago

The thing I don't like about the more moderate proposal is that it special-cases comments. Why should comments behave differently from other raw HTML?

For <pre>, <script> and so on (HTML blocks type 1), it doesn't make sense for them to be inline elements, right?

For comments (blocks type 2), proposal is that:

End condition: line contains the string `-->`, and the first occurrence of this in the string
is at the end of the line (perhaps ignoring trailing whitespace)

We could modify the end condition for blocks type 3, 4, and 5 in the same way.

Blocks type 6 and 7 don't need their end condition modified.

wooorm commented 9 months ago

I think @jgm was thinking more about the concept of interruption, which already exists in CM, in several places but also HTML kind 7. Which does what you want. That’s different from what you are proposing now: that an HTML “production” needs to match an entire line. This new suggestion is more finicky than interruption. It would for example not work nicely with say:

<!--

  xxx

-->.

Currently, markdown will interpret this entirely as raw HTML and emit it as is. The end user will see a dot. Which could be a mistake by the author but not terrible.

With your proposal, markdown parsers will not see any of this as HTML. It will create three paragraphs. And emit all those characters for users to see.

There’s also adjacent comments and more cases I can think of:

<!-- --><!-- -->

And it would be relatively natural to insist that HTML blocks begin with a blank line (or container edge), since they generally have to end with a blank line.

Hmm, not sure. Only the tags do, kind 6 and 7. But that has more to do with them starting at something that looks as XML and otherwise never exit. A blank line is the only thing we can look for. And we want to interleave markdown and HTML. And because tags (in markdown) never contain blank lines.

@dubiousjim

You previously mentioned “And then the ghi\njkl would no longer be parsed as belonging to the list element, but as a new paragraph after the list.”

I don’t think that’s correct. Everything’s working fine: https://spec.commonmark.org/dingus/?text=*%20abc%0A%20%20%3C!--%20def%20--%3E%20ghi%0A%20%20jkl. Can you expand on actual input/output?


I think I’d currently recommend not doing this. HTML in markdown is surprising. There is no way around that. So authors need to learn some rules. The 2 proposals are different, but also have edge cases that need to be learned.

dubiousjim commented 9 months ago

Sorry, I made a mistake. When the second and third lines are indented as in the example I gave:

* abc
  <!-- def --> ghi
  jkl

it does indeed parse correctly. My actual difficulties were because I had omitted the indent from those lines, which works fine without a comment at the start of the line:

$ printf '* abc\nx<!-- def --> ghi\njkl\n' | pandoc -f commonmark -t native
[ BulletList
    [ [ Plain
          [ Str "abc"
          , SoftBreak
          , Str "x"
          , RawInline (Format "html") "<!-- def -->"
          , Space
          , Str "ghi"
          , SoftBreak
          , Str "jkl"
          ]
      ]
    ]
]

But if the comment is at the start of the line (unindented) then it's parsed as an HTML block:

printf '* abc\n<!-- def --> ghi\njkl\n' | pandoc -f commonmark -t native
[ BulletList [ [ Plain [ Str "abc" ] ] ]
, RawBlock (Format "html") "<!-- def --> ghi\n"
, Para [ Str "jkl" ]
]

I guess the simplest fix for this issue would have been to indent the continuation lines. But I didn't realize that until now.

jgm commented 9 months ago

Let's forget lists then...I think that is not an issue.

There's still the following consideration. Suppose you have a document

Foo bar baz <!-- this is
a comment --> bim bop.

And then you reflow the lines to

Foo bar baz
<!-- this is a
comment --> bim
bop

All of a sudden, you have a new paragraph "bop"! (Instead of [Paragraph], you have [Paragraph, RawBlock, Paragraph] -- and "bim" is absorbed in the RawBlock.) I find this confusing and undesirable.

This could be addressed by saying that none of the types of HTML blocks can interrupt paragraphs (thanks for pointing out that we already say this for type 7 HTML blocks, I'd forgotten that). That seems to me a pretty reasonable rule. I suspect there are problems with it that I'm not thinking of.

wooorm commented 9 months ago

Let's forget lists then...I think that is not an issue.

Well, it’s kinda important. Because fixing this issue with interruption doesn’t solve the root problem the user had, with lazy lines. And, not writing indentation, to me is quite similar to folks not using blank lines.

All of a sudden, you have a new paragraph "bop"!

My main thing is that it isn’t that bad. Everything that’s supposed to be shown is still displayed, and everything that’s supposed to be hidden is hidden. Because the “block” comment is also an “inline” comment. Sure there’s some whitespace visually, but it all shows.

And, it’s also a “sudden” thing that so many things in markdown already have too:

Foo # bar
baz.

Foo
# bar
baz.

Adding on, <!-- is a very clear syntax marker. Highly unlikely to occur in “natural” language.

I suspect there are problems with it that I'm not thinking of.

This becomes a problem when a) people expect the current state or don’t think blank lines are that important, and b) people have blank lines in things, breaking out of the paragraph.

1:

aaa
<!--

# bbb

-->

Current:

<p>aaa</p>
<!--

# bbb

-->

Proposed:

<p>aaa\n&lt;!--</p>
<h1>bbb</h1>
<p>--&gt;</p>

2:

aaa
<pre>

[bbb](ccc)

</pre>

Current:

<p>aaa</p>
<pre>

[bbb](ccc)

</pre>

Proposed:

<p>aaa<pre></p>
<p><a href="ccc">bbb</a></p>
</pre>

Interpreted by browser as:

<p>aaa</p><pre><p></p>
<p><a href="ccc">bbb</a></p>
</pre>

3:

aaa
<div>

# bbb

</div>

Current:

<p>aaa</p>
<div>
<h1>bbb</h1>
</div>

Proposed:

<p>aaa<div></p>
<h1>bbb</h1>
</div>

Interpreted by browser as:

<p>aaa</p><div><p></p>
<h1>bbb</h1>
</div>
jgm commented 9 months ago

My main thing is that it isn’t that bad. Everything that’s supposed to be shown is still displayed, and everything that’s supposed to be hidden is hidden. Because the “block” comment is also an “inline” comment. Sure there’s some whitespace visually, but it all shows.

The semantics completely changes, and depending on your output format, "bim" may disappear entirely (since raw HTML isn't passed on to non-HTML formats).

And, it’s also a “sudden” thing that so many things in markdown already have too:

That is true.

I agree that you might find people getting confused with things like 1 and 2, and backwards incompatible changes already have a big strike against them. On the other hand, all you have to learn is that you need a blank line here to avoid these results. That's similar to what people already need to learn about the difference between

<div>

**bar**

</div>

and

<div>
**bar**
</div>
wooorm commented 9 months ago

The semantics completely changes, and depending on your output format, "bim" may disappear entirely (since raw HTML isn't passed on to non-HTML formats).

A paragraph accidentally split up, I wouldn’t call completely changes semantics. Whether HTML is stripped (naïvely) is as far as I know not covered by CommonMark. GitHub and similar projects interpret the HTML and then sanitize it. Tools that target different output formats can also interpret HTML and then map say <u> to underlined if they want or ignore comments.

If that’s important, stuff like this seems more important to solve:

<h1>
  asd
</h1>

^-- Also dropped entirely.

all you have to learn is that you need a blank line here to avoid these results.

To me this is the same as the original root question. “All you have to learn is to indent your list items to put things in list items” and “all you have to learn is that paragraphs are interrupted by blocks”.

And how far to take blank lines? Blank lines before fenced code / headings / thematic breaks / block quotes / other lists / definitions? Common extensions such as GFM footnote definitions/GFM tables/math/directives? The cases where CM currently doesn’t need blank lines are quite limited:

aaa bbb
    ccc.

^-- Weird indent style that some people like, so not seen as indented code.

aaa bbb
2) ccc.

^-- Probably not a numbered list as it doesn’t start with 1, probably just a number in a paragraph because someone is making points.

CM specifically doesn’t need blank lines before like all other things. Because people often don’t write them. I think HTML kind 1 through 6 are similar. To me indeed a) the breaking change likely affecting many existing markdown documents, b) the line between what does (not) need blank lines gets blurrier, c) it’s not that bad, d) there’s a good alternative.