ScottAbbey commented 7 years ago

The CommonMark spec has this to say about link destinations:

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no >spaces, line breaks, or unescaped < or > characters, or

Given the following input:

[a]

[a]: <te<st>

this implementation outputs:

<p><a href="te%3Cst">a</a></p>

However, the Javascript reference implementation outputs:

<p><a href="%3Cte%3Cst%3E">a</a></p>

Note the %3C characters surrounding te%3Cst are missing in the output from this implementation. They should be present, as this should be treated as the non-<...> type of link definition.

This also seems to affect inline link destinations.

jgm commented 7 years ago

Thanks, bug confirmed.

jgm commented 7 years ago

The fix was wrong.

Spec says there are two types of link destination:

a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or
a nonempty sequence of characters that does not include ASCII space or control characters, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses.

The point of this bug report was that <te<st> should be treated as a type 2 link destination, not as a type 1 destination. But the "fix" makes it a type 1 link destination. Expected behavior is to get a link with URL "%3Cte%3Cst%3E".

jgm commented 7 years ago

I reverted the completely mistaken fix. Now we have a failing test.

What we need is a fix for manual_scan_link_url. In writing this, I previously assumed that we'd be able to tell at the outset whether we were scanning for a type 1 or type 2 URL. But this isn't right, if <te<st> should be a type 2 URL. Scanning gets a bit tricky if we can't tell till the end whether we have type 1 or type 2, because type 1 doesn't require parentheses to be balanced.

jgm commented 7 years ago

Now I think I understand why I originally disallowed unescaped nested parens! This made scanning much easier.

jgm commented 7 years ago

We could always have two scanners: first try with a type 1 (if < is the first character), and if that fails, try with type 2. This will mean backtracking, but only on unusual inputs, and only once, so it shouldn't cause performance issues.

kivikakk commented 7 years ago

That seems quite acceptable! Here's a pretty gross diff that does just that:

diff --git a/src/inlines.c b/src/inlines.c
index e30c2af..0500177 100644
--- a/src/inlines.c
+++ b/src/inlines.c
@@ -859,10 +859,14 @@ static bufsize_t manual_scan_link_url(cmark_chunk *input, bufsize_t offset) {
         i += 2;
       else if (cmark_isspace(input->data[i]))
         return -1;
-      else
+      else if (input->data[i] == '<') {
+        i = offset;
+        goto type2;
+      } else
         ++i;
     }
   } else {
+type2:
     while (i < input->len) {
       if (input->data[i] == '\\' &&
          i + 1 < input-> len &&

Is this the kind of thing you mean?

jgm commented 7 years ago

+++ Yuki Izumi [Jul 14 17 04:53 ]:

Is this the kind of thing you mean?

Yes, but I wouldn't want to use goto in this way. I think it's cleaner to have two scanners, one for <..> form, the other for others. The main scanner could apply the first scanner, and if it didn't match anything, then apply the second scanner.

kivikakk commented 7 years ago

219 is a PR for a better fix.

jgm commented 7 years ago

Reopening because we don't correctly handle this case:

[hi](<aaa)

which should be a link with URL "%3caaa", but is not parsed as a link at all. There should be a regression test, or better a spec test, for this.

mity commented 7 years ago

Note the PR #219 also does not fix the following case:

[a](<x>X)

The problem is that if the parsing of the whole link (or ref. def.) with link destination type 1 fails, parsing of whole link (or ref. def.) with the type 2 should be retried. Doing the retry only on the destination itself is not enough.

~BTW, MD4C also fails on this.~ EDIT: MD4C now has this fixed but my proposal below to simplify it still holds.

Although it is fixable, IMHO the question is whether it is worth it. Take this as my vote to ban the retrying completely in the specs. To be more precise, I vote for something as follows (bold text is new):

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

a nonempty sequence of characters that does not include ASCII space or control characters, does not begin with unescaped <, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

This would allow the parser to decide how to parse by checking the first character of the maybe-link-destination string.

Rationale:

Link URLs starting with < are rare so usage of it for type 2 is likely minimal.
With such URL, users may need to escape < anyway even now if the URL also ends with >.
Simplification of the implementation to the level which would make similar regressions unlikely.

Note the ambiguity implied in the point 2 is unfixable without change in specification because, with its current wording, <xxxx> simply fulfills both types of link destinations.

kivikakk commented 6 years ago

+1 to banning retrying. I can't think of many useful URIs that begin with <.

mity commented 5 years ago

This issue should be likely closed, as a result of https://github.com/commonmark/CommonMark/commit/c1e0183685e22282cbf5651970200fe56a712002. That actually bans the retrying as asked for above.

jgm commented 5 years ago

Yes, we can close this now.

commonmark / cmark

Unescaped left angle brackets in link destination #193

219 is a PR for a better fix.