commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.88k stars 317 forks source link

Mention that link reference definitions are constructed from paragraphs #605

Open wooorm opened 5 years ago

wooorm commented 5 years ago

Problem

According to the spec text:

[
␠␠␠␠# a
␠␠␠␠b
␠␠␠␠]:
␠␠␠␠example.com '
␠␠␠␠line1
␠␠␠␠...
␠␠␠␠'

...is fine: it’s a proper link reference definition. This lead me to believe that true streaming, as noted in § Appendix ¶ Phase 1, wouldn‘t work because if the last apostrophe wasn’t there, we’d need to backtrack (to the start because the opening apostrophe is on the line of the destination, if it was on its own line, the definition would be valid but we’d still need to backtrack to parse the title again).

To my surprise, the following is not a link reference definition (note one less space before a)

[
␠␠␠# a
␠␠␠␠b
␠␠␠␠]:
␠␠␠␠example.com '
␠␠␠␠line1
␠␠␠␠...
␠␠␠␠'

...# a is now a heading! Only then did I see that the Appendix contains:

Reference link definitions are detected when a paragraph is closed; the accumulated text lines are parsed to see if they begin with one or more reference link definitions. Any remainder becomes a normal paragraph.

Solution

I think it’s good to mention in the main text that link reference definitions are created from paragraphs, and include a test for it. Not entirely sure how to describe this though. This will also help prevent blank lines that are currently possible in labels (GH-586)

Extra

As paragraph lines are made into actual paragraphs and definitions, setext heading lines come into play, so relating to GH-395, I think the following may also be interesting to expand upon:

[foo]: /url
'alpha
=
bravo'

[foo]

Dingus:

<h1>'alpha</h1>
<p>bravo'</p>
<p><a href="/url">foo</a></p>
jgm commented 5 years ago

The reference parser does construct these from paragraphs (similarly setext headers). That's an implementation detail, though. If we didn't care about efficiency, we could simply have a separate block parser for these and backtrack.

wooorm commented 5 years ago

Implementation details should indeed be in the appendix, agreed, but what my issue is more about, is that there’s nothing in the spec arguing for, taking a maybe more clear example, why:

[
# alpha
]: https://example.com

[# alpha][]

Yields a heading.

jgm commented 5 years ago

Yes, I agree that more needs to be said about reference link definitions. I'm just not sure talking about "paragraphs" is the best way to do it.

wooorm commented 5 years ago

I can’t see an easy solution.

One way would be to use “interrupting content” instead of “interrupting paragraphs”:

An indented code block cannot interrupt a paragraph a content line. (This allows hanging indents and the like.)

ATX headings need not be separated from surrounding content by blank lines, and they can interrupt paragraphs content lines:

...and then both definition “lines” and paragraphs fall into that category? 🤔

jgm commented 5 years ago

Another alternative would be just to say "interrupt a paragraph or a link reference definition."

wooorm commented 5 years ago

Yeah, maybe that’s good! I’m not so sure about the word paragraph, as setext headings are made from that construct, but as they are headings, they aren‘t really paragraphs

jgm commented 5 years ago

setext headings are made from that construct

That's just how they're handled in the reference implementation (for parsing efficiency). As far as the spec goes, they have nothing to do with paragraphs.

wooorm commented 4 years ago

Another point of confusion for me, I don‘t understand the interplay between paragraphs/setext headings/definitions:

E.g.,:

[a]: b
    content?

a
=
    content?

Yields:

content?

a

content?

What gives that there can be code after a setext heading, but not a definition? I was expecting both content?s to be paragraphs.

vassudanagunta commented 4 years ago

It seems to me the discussion above assumes that that CommonMark.js / Dingus behavior is the spec and thus the spec needs to be updated to conform to that behavior. I would suggest that this is the wrong way to look at it (with the one exception of maintaining backward compatibility that should be maintained, since that is a CommonMark spec goal).

For example, I'm working on an implementation of the CommonMark spec. It passes all the tests, yet does NOT treat the # alpha in @wooorm's example as a heading. It interprets it as the label of a link ref def.

As far as the spec goes, they have nothing to do with paragraphs.

The reference parser does construct these from paragraphs (similarly setext headers). That's an implementation detail, though. If we didn't care about efficiency, we could simply have a separate block parser for these and backtrack.

This is what my implementation does.

Given Markdown's principles (reader oriented), to me the way one decides is by asking: What does the following look like to most readers?

[
# alpha
]: https://example.com

[# alpha][]

Though at the end of the day, it's an unimportant corner case. If the author of the above Markdown cared about the reader, they would not write something so unnecessary! The line breaks serve no purpose.

But also, by that same note, any inefficiency resulting from rules that would require backtracking (is look-ahead considered backtracking?) would only affect such corner cases.