Closed ScottAbbey closed 5 years ago
Thanks, bug confirmed.
The fix was wrong.
Spec says there are two types of link destination:
a sequence of zero or more characters between an opening <
and a
closing >
that contains no spaces, line breaks, or unescaped
<
or >
characters, or
a nonempty sequence of characters that does not include ASCII space or control characters, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses.
The point of this bug report was that <te<st>
should be treated as a type 2 link destination, not as a type 1 destination. But the "fix" makes it a type 1 link destination. Expected behavior is to get a link with URL "%3Cte%3Cst%3E"
.
I reverted the completely mistaken fix. Now we have a failing test.
What we need is a fix for manual_scan_link_url
. In writing this, I previously assumed that we'd be able to tell at the outset whether we were scanning for a type 1 or type 2 URL. But this isn't right, if <te<st>
should be a type 2 URL. Scanning gets a bit tricky if we can't tell till the end whether we have type 1 or type 2, because type 1 doesn't require parentheses to be balanced.
Now I think I understand why I originally disallowed unescaped nested parens! This made scanning much easier.
We could always have two scanners: first try with a type 1 (if <
is the first character), and if that fails, try with type 2. This will mean backtracking, but only on unusual inputs, and only once, so it shouldn't cause performance issues.
That seems quite acceptable! Here's a pretty gross diff that does just that:
diff --git a/src/inlines.c b/src/inlines.c
index e30c2af..0500177 100644
--- a/src/inlines.c
+++ b/src/inlines.c
@@ -859,10 +859,14 @@ static bufsize_t manual_scan_link_url(cmark_chunk *input, bufsize_t offset) {
i += 2;
else if (cmark_isspace(input->data[i]))
return -1;
- else
+ else if (input->data[i] == '<') {
+ i = offset;
+ goto type2;
+ } else
++i;
}
} else {
+type2:
while (i < input->len) {
if (input->data[i] == '\\' &&
i + 1 < input-> len &&
Is this the kind of thing you mean?
+++ Yuki Izumi [Jul 14 17 04:53 ]:
Is this the kind of thing you mean?
Yes, but I wouldn't want to use goto in this way. I think it's cleaner to have two scanners, one for <..> form, the other for others. The main scanner could apply the first scanner, and if it didn't match anything, then apply the second scanner.
Reopening because we don't correctly handle this case:
[hi](<aaa)
which should be a link with URL "%3caaa", but is not parsed as a link at all. There should be a regression test, or better a spec test, for this.
Note the PR #219 also does not fix the following case:
[a](<x>X)
The problem is that if the parsing of the whole link (or ref. def.) with link destination type 1 fails, parsing of whole link (or ref. def.) with the type 2 should be retried. Doing the retry only on the destination itself is not enough.
~BTW, MD4C also fails on this.~ EDIT: MD4C now has this fixed but my proposal below to simplify it still holds.
Although it is fixable, IMHO the question is whether it is worth it. Take this as my vote to ban the retrying completely in the specs. To be more precise, I vote for something as follows (bold text is new):
A link destination consists of either
a sequence of zero or more characters between an opening
<
and a closing>
that contains no spaces, line breaks, or unescaped<
or>
characters, ora nonempty sequence of characters that does not include ASCII space or control characters, does not begin with unescaped
<
, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)
This would allow the parser to decide how to parse by checking the first character of the maybe-link-destination string.
Rationale:
<
are rare so usage of it for type 2 is likely minimal.<
anyway even now if the URL also ends with >
.Note the ambiguity implied in the point 2 is unfixable without change in specification because, with its current wording, <xxxx>
simply fulfills both types of link destinations.
+1 to banning retrying. I can't think of many useful URIs that begin with <
.
This issue should be likely closed, as a result of https://github.com/commonmark/CommonMark/commit/c1e0183685e22282cbf5651970200fe56a712002. That actually bans the retrying as asked for above.
Yes, we can close this now.
The CommonMark spec has this to say about link destinations:
Given the following input:
this implementation outputs:
However, the Javascript reference implementation outputs:
Note the
%3C
characters surroundingte%3Cst
are missing in the output from this implementation. They should be present, as this should be treated as the non-<...>
type of link definition.This also seems to affect inline link destinations.