commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.87k stars 313 forks source link

inline parsing algorithm, spec conflicts with implementation #686

Open rsc opened 3 years ago

rsc commented 3 years ago

The parsing strategy described in https://spec.commonmark.org/0.30/#appendix-a-parsing-strategy does not appear to reflect the actual implementations.

Specifically, the section "An algorithm for parsing nested emphasis and links" would suggest that other kinds of inlines are handled separately, either before or after. But it appears to be impossible for that to be the case, at least for code spans.

  1. If code spans were handled before the "An algorithm", then [foo](`url`) would not parse as a link, yet it does in the dingus.
  2. If code spans were handled after the "An algorithm", then `[foo](url)` would not parse as a code span, yet it does in the dingus.

It must therefore be the case that code spans are handled during the "An algorithm", in some way that is not described. That is, the algorithm conflicts with the implementation.


The spec also seems to conflicts with the implementation. It says:

Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.

And yet in the dingus [foo](`url`) is a link, not a code span with text around it.


There is a long discussion about backticks in link parens in #439 but it never reached any conclusion. But the outcome of #439 is independent of my main point, which is that the implementation and the algorithm+spec are out of sync.

wooorm commented 3 years ago

Specifically, the section "An algorithm for parsing nested emphasis and links" would suggest that other kinds of inlines are handled separately, either before or after.

I think it’s at the same time (as you later also conclude).

Things are parsed left to right, and the first one wins, which is why your first one is a link whereas the second is code. But most things are parsed as one thing. Wheres links/images/emphasis/strong are parsed separately. The link end is only parsed if there is an opening for it. That’s why foo](`url`) does turn into code. So, for case one, [ is a potential opening, and that’s why ](`url`) is parsed as a closing for that. For case 2, the ` starts code and runs till the end

I think this fact of “at the same time” is implied by the phrase “When we’re parsing inlines and we hit either ...”. I interpret that as saying, well, when we hit other inlines, we do other stuff, but this section is about emphasis/strong/links/images.

I believe I‘ve also suggested in the past to make the Appendices more factual (e.g., https://github.com/commonmark/commonmark-spec/issues/605#issue-492239437), but the response there probably also applies for your case.


I do think you’re right on “Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.” being an unneeded sentence. I think “precedence” is a legacy term than markdown used to have, but not anymore?

jgm commented 3 years ago

Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.

Yes, I agree that this seems to conflict with current behavior.

rsc commented 3 years ago

I think I understand the algorithm now. I would suggest that section 6 start with a bit more elaboration of the left-to-right sentence, something like:


Inlines, like blocks, can be categorized into leaves (code spans, autolinks, raw HTML, line breaks) and containers (emphasis, links, and images, which can themselves contain other inlines). Inlines are parsed sequentially from the beginning of the character stream to the end (left to right, in left-to-right languages) into leaf inlines, container opening markers, and container closing markers. Leaf inlines such as code spans and HTML tags are converted immediately. For example:

\<example 327> new example: <a x="\`">` (an html tag followed by `) new example: `<a x="\`">` (a code span followed by ">`)

Container inlines are more subtle, because an opening token such as [ or * only has an effect when paired with an appropriate closing token, with other inlines in between. Leaf inlines such as code spans and HTML tags may consume what might otherwise appear to be a closing marker. For example:

\<example 523> \<example 524> \<example 525>

On the other hand, a closing marker like ](...), once begun, is unaffected by what might appear to be other tokens inside it. For example:

new example: [foo](/?x=`)` (a link followed by `)

Opening and closing link and image brackets are matched during the initial parsing of the character stream, while emphasis markers are matched and applied in a second pass. The effect is that links take priority over emphasis:

\<example 520> \<example 521>


I realize this is a little bit more operational than most of the spec. On the other hand I don't see a way to be precise enough to avoid confusion without being this operational. It seems fundamental that there's a single token parsing pass that operates left-to-right pulling out leaf inlines and matching link/image open/close markers, all in one go, and then a second pass to apply emphasis. That's not an optimization: you have to do all of that in a single pass to get the right answer. The behavior can't be described by relative precedence. It's not true that code spans are higher precedence than links or HTML tags, or vice versa. It's only true that code spans, HTML tags, link opens, link closes, etc are all tokenized left-to-right at the same time, with links matched during that left-to-right scan and emphasis applied in a second pass.

jgm commented 3 years ago

Thanks for this. I agree that adding some text like this would be right, assuming we want to keep the current parsing strategy and throw out or qualify the statement about precedence in the spec.

But is it worth exploring the other option? Taking the statement about precedence seriously and modifying the parsing algorithm in the implementations.

What sorts of examples would this affect, other than

[hello](`foo`)
wooorm commented 3 years ago

@jgm i'm not sure I understand, are you suggesting to turn that into code? 🤔

jgm commented 3 years ago

Well yes, we could parse [hello](`foo`) as TEXT "[hello](" CODE "foo" TEXT ")"

wooorm commented 3 years ago

And what about asterisks, underscores? Tildes for gfm? What about auto-links, which can either be a link destination or an auto-links on their own? (Even more complex With HTML?)

My initial reaction was: eek, that's probably breaking a lot of markdown, but maybe I'm not seeing the consequences correctly

jgm commented 3 years ago

This would just be respecting the statement about precedence for inline code. Thus, it would not affect asterisks and underscores. Auto-links are more difficult; we certainly can't go back on letting [text](<http://example.com>) count as a link. Then there are cases like

[hi](<a href="x">)</a>
wooorm commented 3 years ago

Given that every parser (on babelmark3) sees that as a link, rather than honing the “precedence”, I personally feel changing one sentence in a spec to reflect reality is the easiest, compared to trying to change every parser. Personally [hello](`foo`) looks more like a link than code. @rsc, as the OP, what do you think about these cases and their most sensical output?


One more complex example: [text](<svg:rect>); the <svg:rect> is both valid “HTML”, a valid autolink, and a valid destination for the link. (similar cases: <xml:lang/> andr <svg:circle{...props}>)

rsc commented 3 years ago

The current left-to-right scan that takes things one token at a time seems to me the only viable possibility, given that different syntaxes can end up breaking up into different token sequences. It's not like an arithmetic expression grammar where you can tokenize in one pass, without looking at the operators, and then apply operator precedence in a different pass. That is, these are not really operators in the same way. So I don't think it's tenable to say that code spans really have higher precedence than link-closing tokens. Precedence is just the wrong concept.

Changing the rule would make this a code span, not a link:

[hello](foo "`quotes'")`

Yet the current spec is clear that this is an HTML tag, not a code span:

x <a foo="`quotes'">`

The fact that the latter is not a code span means that you can't split out the code spans in a separate pass. The code span extractor would have to be at least aware of which backquotes were inside HTML tags, and that means doing the full HTML tag parser. And then probably autolinks too in that pass.

It seems much clearer and cleaner to do it all in one pass, with only strong/emphasis application left to a second pass, instead of trying to build a sequence of passes that would almost certainly step on each other's feet.

The sentence

Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks.

is almost true. It could be made true by saying

Code span backticks have higher precedence than any other inline constructs except HTML tags, autolinks, and link closing ] sequences.

but at that point there's nothing with lower precedence that might conflict, so the sentence is essentially vacuous. And the complex case at the end of @woorm's comment just above this one is another place where notions of precedence would get very tricky.

The left-to-right in the first sentence of section 6 seems to be the key, both to the current implementations and to a simple mental model of what happens.

jgm commented 3 years ago

but at that point there's nothing with lower precedence that might conflict, so the sentence is essentially vacuous.

I'm not sure I understand.

This is not *`emphasized*`.

Here you see that code takes precedence over emphasis markers.

This is not [`a link]`.

[`a link]: http://example.com

Here you see that code takes precedence over link markers. The only exception is what happens in inline links, where the ](...) part ignores what would otherwise be raw HTML or code.

rsc commented 3 years ago

You are right about the emphasis being lower precedence than code spans - I was not entirely precise about that. It remains true that emphasis is lower precedence than the others. My point is that the others are essentially all equal precedence.

Here you see that code takes precedence over link markers.

That's not how I see this example.

It is not that code takes precedence over link markers. Instead, it is that code and link closing markers have the same precedence and the left-to-right rule is what disambiguates. Either code or link closing can "ignore" the other. The one that is leftmost wins.

This is [a link](`x)`.

This is [a code span: `](x`).

It is important to view link framing as two different markers - the [ and the ]... - and not just one, so that leftmost is evaluated correctly when a ]... is involved.

jgm commented 3 years ago

Either code or link closing can "ignore" the other.

This is correct about the link closers, yes (either the ](..) in inline links or the ][..] in explicit reference links). But not about link openers, or the closing bracket in a link label in a reference link.

% cmark
[a link`][]`

[a link`]: foo
^D
<p>[a link<code>][]</code></p>

This is what I was thinking about with the remark about precedence.

Anyway, I want to agree with you that the remark about precedence is not quite right. However, there is something in that vicinity that is not vacuous that needs to be said.

rsc commented 3 years ago

I see what you mean. Perhaps the statement to make would be:

Code spans, HTML tags, autolinks, link labels, link destinations, and link titles all have equal precedence and are disambiguated by parsing the input left-to-right.