Closed puzrin closed 9 years ago
@puzrin - can you provide some examples of the cases you describe in which an exception is thrown? And also of the non-standard urls for international domain names?
I tried using the url
module that comes with core node: url.format(url.parse(...))
instead of encodeURI(unescape(..))
. That gave pretty good results. Why do you prefer the non-core module node-url?
(Unfortunately, adding this module adds 40K to commonmark.js
.)
Oh, I guess node-url is just the module version of url, so that's probably what I was in fact using with require('url')
.
https://github.com/markdown-it/markdown-it/blob/master/lib/common/utils.js#L173-L185 - that's current implementation. It does not wrap encodeURI, but i'm not sure for 100% that it doesn't throw exception on something except broken surrogates. Probably, should wrap it too.
By the way, although "just use a library" is attractive advice, there is an added complication here. With the C parser, we're already using a well-tested URL-escaping library devised by github (vmg/houdini), and it gives results that don't always match node-url. (In fact, it gives the results recorded in the spec.txt
examples.) In most cases the differences seem fairly arbitrary: for example, foo%20b%C3%A4
(houdini) and foo%20bä
(node-url) are both okay, and I don't think it's a mistake to percent-encode things in the query string.
With your example 1, [foo](<%test>)
, commonmark.js
gives me the correct result with no exception.
% js/bin/commonmark
[foo](<%test>)
<p><a href="%25test">foo</a></p>
Interesting, cmark
gives me the wrong result and seems to indicate a bug in houdini:
% cmark
[foo](<%test>)
<p><a href="%test">foo</a></p>
Even more interestingly, when I modify commonmark.js
to use node-url
for normalization, it gives the wrong result (same as cmark). (Am I wrong to think this is the wrong result?)
I wasn't sure how to translate your test case 2 into a problem rendering links with one of my implementations.
I tried using the url module that comes with core node: url.format(url.parse(...)) instead of encodeURI(unescape(..)). That gave pretty good results. Why do you prefer the non-core module node-url?
System's url
and npm's url
are the same. Probably, browserify will bundle system one without extra efforts. If not - adding dependency will help. They are the same.
But i did not decided yet for another reason - that lib does encoding only for domain name. Seems, decode/encode still have to be called manually for path and query. It's a bit strange - i don't beleive that nobody yet solved task of URL normalization in node. So, i postponed futher investigations until we finish trading with entities at forum.
There also exists high speed https://github.com/petkaantonov/urlparser , but not sure parse speed make sense for our case.
Even more interestingly, when I modify commonmark.js to use node-url for normalization, it gives the wrong result (same as cmark). (Am I wrong to think this is the wrong result?)
I think, they have some fallback modes. Because good lib should not throw exception on url parse, as decodeURL/encodeURL do.
Don't remember details. Try to search in google "decodeURI throw exception". Seems it in can do it on invalid codepoints and broken utf8-sequences (in %xx
). As far as i see, my test try to form invalid codepoint, but test with broken sequence not exists.
I wasn't sure how to translate your test case 2 into a problem rendering links with one of my implementations.
Astral characters (> 0xFFFF) in javascript's UTF-16 are encoded with 2 chars (surrogates) Those MUST be /[\uD800-\uDBFF][\uDC00-\uDFFF]/. Any other combination of this range is broken (and should cause encodeURI exception)
Since tests are encoded in UTF-8, it should not be problem to inject broken surrogate sequence into URL. If you pass any of this samples to encodeURI, it should crash. Forming invalid bytes in text samples (integration test) was not comfortble for me, and i did only unit testing. Not perfect, but should be enougth. Will improve later sometime.
(Unfortunately, adding this module adds 40K to commonmark.js.)
Just a note. Never count source size. Only minified + gzipped make sense. It's 4.7kb for URL. Quite big for such "simple" operation. I don't know well tested and forgiving alternatives for JS. May be with regexp, but not sure about edge cases.
Here is updated examples to nuke normalizer:
[foo](�) - broken utf-16 surrogate sequence (encodeURI) (also can be acheived by direct charcode injection into test text file).
[foo](<%test>) - `*` not allowed in path (decodeURI)
[foo](%C3) - broken utf-8 sequence (decodeURI)
http://spec.commonmark.org/dingus.html?text=[foo]%28%26%23xD800%3B%29
Here is the same in markdown-it
Last 2 do not fuckup dingus only because you use unescape
instead of decodeURI
. But unescape
is completely unsuitable for this task. In markdown-it
last 2 examples will not work, because we have some simple check for invalid codepoints in entities decoder.
PS. this tests are for application, not for spec. Behaviours with invalid URLs is not specified, and different apps can produce different results. It's only required to not crash with exception.
(Unfortunately, adding this module adds 40K to commonmark.js.)
Just a note. Never count by source file size. Only minified + gzipped make sense. For url
package it's ~ 4.7K. Also not pleasant, but acceptable for the first approach. I think, now 2 things are important:
It appears that cmark/houdini don't handle the broken surrogate pairs correctly:
% ./cmark
[link](http://example.com/�Y)
<p><a href="http://example.com/%ED%A1%90Y">link</a></p>
% node
> decodeURI("http://example.com/%ED%A1%90Y");
URIError: URI malformed
It seems to me that there are two options here. First option is to change both cmark and commonmark.js so that broken surrogates are replaced by U+FFFD, as you currently do in markdown-it.
Second option is simply to leave the present behavior.
With either option, we need to catch errors in decodeURI
and encodeURI
in the JS.
My view is that the second option is best. Even if we took the first option, it wouldn't guarantee that we always had valid URIs, since we'll still fallback to the un-normalized original when we catch other kinds of exceptions. It seems reasonable to me to abide by the principle "garbage in, garbage out."
In effect, the first option is trying to fix one kind of problem in a URI, but it doesn't help with a bunch of other problems. And the "fix" still leaves you with something the user didn't intend. So, for the sake of simplicity, I think it would be best to fall back to the original on any kind of error in decoding and encoding the URI (the simple behavior of current commonmark.js).
As for whether this should be in the spec: The tests are going to have to have URLs in them, so either we need to specify the normalization algorithm, or not test any URIs that are not already in normal form. I think we do need tests for non-normalized URIs, like [foo](bar>baz)
, so that counts in favor of saying in the spec how the URIs will be nomalized. Why would this be bad?
It appears that cmark/houdini don't handle the broken surrogate pairs correctly:
Those work with utf-8, not utf-16. There are no requirement to work differently in supplementary range. 0xD800-0xDFFF are normal codes in utf-8. Those are just encoded as is.
As for whether this should be in the spec: The tests are going to have to have URLs in them, so either we need to specify the normalization algorithm, or not test any URIs that are not already in normal form. I think we do need tests for non-normalized URIs, like foo, so that counts in favor of saying in the spec how the URIs will be nomalized. Why would this be bad?
Normalization is for valid urls, not for broken. At least, i don't know any official RFC or any other document, that specify behaviour for broken encoding. I use different behaviour because it's more convenient for my needs. It's application-level things, not spec level.
I don't ask to remove normalization tests completely, i ask to not use there broken samples with invalid urls. There are no need to specify in spec things, which are not specified anywhere else (i mean rfc, html and other specs).
It seems reasonable to me to abide by the principle "garbage in, garbage out."
That's because you don't care about security and leave this problem to external sanitizer. When links are passed "as is", that makes more difficult to validate schemas (prohibit javascript & vbscript), because name can be partially-encoded.
In markdown-it
such cases become decoded or encoded twice. First case can be filtered, second will not work in browser. Alternative is to write 2 normalizers and do the same job twice. I see no reasons for it. Your approach works for your case, but adds unnesessary difficulties in my case with zero added value.
+++ Vitaly Puzrin [Jan 14 15 00:47 ]:
It seems reasonable to me to abide by the principle "garbage in, garbage out."
That's because you don't care about security and leave this problem to external sanitizer. When links are passed "as is", that makes more difficult to analyze prohibited schemas (like javascript & vbscript), because name it can be partially-encoded.
In
markdown-it
such cases become decoded or encoded twice. First case can be filtered, second will not work in browser. Alternative is to write 2 normalizers and do the same job twice. I see no reasons for it. You approach works for your case, but adds unnesessary difficulties in my case with zero added value.
So, does your normalization code guarantee that the output is a valid URI? I thought that it couldn't guarantee this, since it wraps decodeURI in a try. If it doesn't guarantee this, then you need to rely on another tool for security anyway. But if it does guarantee this, that's another story.
(I am looking at the code in lib/common/utils.js
. Is there also another place where normalization occurs?)
My argument was, essentially: "If you can't guarantee that the output will be a valid URI, then there's little point in fixing some of the errors that can arise."
It's technically impossible to guarantee valid output from invalid data (i mean, valid === what user need). And it's not needed. I guarantee that important parts are properly normalized or cleared/escaped. Technically, content of my URIs is "valid" (satisfy RFC).
(I am looking at the code in
lib/common/utils.js
. Is there also another place where normalization occurs?)
No, it's the only one function, called from different places.
My argument was, essentially: "If you can't guarantee that the output will be a valid URI, then there's little point in fixing some of the errors that can arise."
John, IMHO, you mix specification level and application level. It's the root of problem.
If app can't guarantee valid URI, it's only app's choice what to do. You pass data as is, because it's the best for your app (since you don't need to cover security issues). For my app choice is different. Both are correct, because this don't belong to specification.
+++ Vitaly Puzrin [Jan 14 15 08:47 ]:
It's technically impossible to guarantee valid output from invalid data (i mean, valid === what user need). And it's not needed. I guarantee that important parts are properly normalized or cleared/escaped. Technically, content of my URIs is "valid" (satisfy RFC).
My question is: does your code guarantee that all URIs that pass through the normalization routine satisfy the RFC?
If not, then further security validation is still needed, so why bother with things like bad surrogate code pairs in this code?
If so, I'd like to understand how it does guarantee this, given the way it wraps decodeURI in a try.
My argument was, essentially: "If you can't guarantee that the output will be a valid URI, then there's little point in fixing some of the errors that can arise."
John, IMHO, you mix specification level and application level. It's the root of problem.
If app can't guarantee valid URI, it's only app's choice what to do. You pass data as is, because it's the best for your app (since you don't need to cover security issues). For my app choice is different. Both are correct, because this don't belong to specification.
Is your opinion that the spec shouldn't say anything about URI normalization? If so, then the spec can't include embedded tests with URIs that aren't already in normalized form.
It seems to me that some such tests are needed, e.g. for entity resolution in URIs. But then the spec needs to say at least something about URI normalization, in order to know what the proper result should be for
[link](/ö)
or even for
[link](/ö)
E.g., should the link be to /%C3%B6
or to /ö
or to /ö
?
If you're with me so far, then you agree that the spec should say something about URI normalization.
Then the question is just how much should it say? And that's what I'm trying to get clear about. It wouldn't be out of line, I think, for the spec to say what to do when invalid surrogate pairs or percent-encoded bad UTF-8 sequences are given. Maybe it shouldn't, but maybe it should.
Anyway, understanding what your implementation does is important for thinking this through.
I should add: another option would be for the spec to say nothing about URI encoding. With this option, we'd just store the verbatim contents of their link in the AST (after entity resolution), and leave it entirely to the renderers to decide how to escape the URIs.
If the tests were converted to XML, then this would pose no problem for embedded tests.
While the tests remain as HTML, we could deal with the issue by tweaking tests/normalize.py
.
This seems an attractive option to me, especially in light of the following consideration. Suppose you're targeting LaTeX in a renderer. An image might then link to a local file. And in that case, you wouldn't want percent-encoding. So that's an argument that all URI encoding issues should be left up to the renderer, not handled in the parser (which is what the spec covers).
If you decide to follow the last option (not describing the encoding) then you will not be able to include HTML outputs in the spec (it would be nice for them to be included). But also note that most implementations will target HTML as their primary output so the developers (like me) will be grateful for these tests and description on how to do the endoding.
My question is: does your code guarantee that all URIs that pass through the normalization routine satisfy the RFC?
Yes (except minor things, been done after use node-url
).
Is your opinion that the spec shouldn't say anything about URI normalization? If so, then the spec can't include embedded tests with URIs that aren't already in normalized form.
- My opinion is, that if spec says about normalization/encoding, it should cover only "officially known" cases ("valid" URLs).
- We need to clarify term "normalization", to understand each other right. Issue with broken chars refer to "URI encoding". Normalization is replacing entities, converting domain name to lower case and so on. Normalization is optional, encoding is mandatory (for html output).
E.g., should the link be to
/%C3%B6
or to/ö
or to/ö
?
(1) and (3) are defined in url encoding rfc. It's enougth to refer. (1) is correct (for HTML). (3) is encodable, if you wish to make 100% correct output. On practice, if you pass it "as is", browsers will unterstand it.
(2) (/ö
) is not defined anywhere, and belongs to commonmark spec or app level, depending on context. Without context it's impossible to define "proper" behaviour.
If you're with me so far, then you agree that the spec should say something about URI normalization.
Yes of cause. My comments pointed things, that should be avoided there. I meaned, spec should say clear about valid urls, and not say anything about invalid ones.
This seems an attractive option to me, especially in light of the following consideration. Suppose you're targeting LaTeX in a renderer. An image might then link to a local file. And in that case, you wouldn't want percent-encoding. So that's an argument that all URI encoding issues should be left up to the renderer, not handled in the parser (which is what the spec covers).
I think we are specking about different things. URI can be in AST as is - not a problem (and our implementtion don't have AST at all). Though, it's not very clear, that spec says about AST. Also, since many people expect HTML as "main use" of markdown it's good to have outputs from different parsers binary equal, when possible.
+++ Kārlis Gaņģis [Jan 14 15 10:14 ]:
If you decide to follow the last option (not describing the encoding) then you will not be able to include HTML outputs in the spec (which would be nice). But also note that most implementations will target HTML as their primary output so the developers (like me) will be grateful for these tests and description on how to do the endoding.
We could still include HTML outputs; we'd just need to adjust the HTML
normalization routine we already use (test/normalize.py
) so that it
normalizes the URIs in links and images.
+++ Vitaly Puzrin [Jan 14 15 10:14 ]:
This seems an attractive option to me, especially in light of the following consideration. Suppose you're targeting LaTeX in a renderer. An image might then link to a local file. And in that case, you wouldn't want percent-encoding. So that's an argument that all URI encoding issues should be left up to the renderer, not handled in the parser (which is what the spec covers).
I think we are specking about different things. URI can be in AST as is - not a problem (and our implementtion don't have AST at all). Though, it's not very clear, that spec says about AST.
The spec says:
Since this document describes how Markdown is to be parsed into an abstract syntax tree, it would have made sense to use an abstract representation of the syntax tree instead of HTML. But HTML is capable of representing the structural distinctions we need to make, and the choice of HTML for the tests makes it possible to run the tests against an implementation without writing an abstract syntax tree renderer.
That was intended to make it clear that the spec is concerned only with how the input document is transformed into the AST, and not with how the AST is rendered as HTML. If we switch to using an XML representation of the AST in the spec examples, this will be much clearer.
If the spec should leave the link destination precisely as it is in the source (after backslash-unescaping and entity resolution), then none of these issues about normalization or percent encoding/unencoding belong in the spec.
Also, since many people expect HTML as "main use" of markdown it's good to have outputs from different parsers binary equal, when possible.
Yes, we should probably have a supplementary document with recommendations for HTML rendering. (Or perhaps, if suggested HTML renderings were included in the spec along with XML, that would serve this purpose.)
we'd just need to adjust the HTML normalization routine we already use (
test/normalize.py
) so that it normalizes the URIs in links and images.
Right now i suggest to use minimal possible set of steps:
%
)\
)The rest can be clarified in next versions, as soon as additional feedback appears from users and developpers.
That was intended to make it clear that the spec is concerned only with how the input document is transformed into the AST, and not with how the AST is rendered as HTML.
...
That's a bit confusing. AST usually means data representation via tree-like structures. I expected, that spec defines markup syntax + some specific mandatory text transforms like unescaping. Formally, markdown-it does not have AST and then does not satisfy spec requirements :)
+++ Vitaly Puzrin [Jan 14 15 10:34 ]:
we'd just need to adjust the HTML normalization routine we already use (
test/normalize.py
) so that it normalizes the URIs in links and images.Right now i suggest to use minimal possible set of steps:
- Say, that "URL should be unescaped and encoded in standard way", with link to rfc.
- include simple examples
- include partially-encoded examples (with
%
)- include examples with markdown escaping (with
\
)- Don't say anything about entities replacements, and don't include such examples to spec
- Avoid specific examples with broken encoding sequences.
- Avoid examples with international domains.
- Avoid examples, which can be additionally normalized (4 steps)
The rest can be clarified in next versions, as soon as additional feedback appears from users and developpers.
Re (2): the spec already specifies that
"Entities are recognized in any context besides code spans or code blocks, including raw HTML, URLs, link titles, and fenced code block info strings."
and gives a number of examples. This conforms to prior practices with Markdown processors, and shouldn't change.
The paragraph at the beginning of the section (added by vmg with insufficient review by me) is not quite accurate:
"With the goal of making this standard as HTML-agnostic as possible, all valid HTML entities in any context are recognized as such and converted into unicode characters before they are stored in the AST."
As the later paragraph makes clear, it's not "any context," but rather "any context besides code spans or code blocks." I'll change this.
+++ Vitaly Puzrin [Jan 14 15 10:49 ]:
That was intended to make it clear that the spec is concerned only with how the input document is transformed into the AST, and not with how the AST is rendered as HTML.
...
That's a bit confusing. AST usually means data representation via tree-like structures. I expected, that spec defines markup syntax + some specific mandatory text transforms like unescaping. Formally, markdown-it does not have AST and then does not satisfy spec requirements :)
Saying what the syntax means is, effectively, defining a function from the input source to an AST. An implementation need not actually construct a representation of the AST internally, but it must give output that would make sense for the AST that the spec defines for the input.
We don't want to make the spec HTML-specific, because we want a conforming implementation to be able to target multiple formats.
Re (2): the spec already specifies that ... and gives a number of examples. This conforms to prior practices with Markdown processors, and shouldn't change.
Okay. If it's the only significant difference and you are agree with the rest, i see no problems to do it now. Such spec will not limit me to do additional "beatifications" (those will not cause broken tests). Everything else can be discussed separately and clarified in next spec versions (or in html recommendations section).
This is a surprisingly difficult issue. Let me just summarize some of the considerations that have been given.
encodeURI(decodeURI(.))
. Of course, this will crash in certain malformed URIs. Here we could either specify a fallback (again, only for purposes of testing) or just refrain from putting such examples in the spec, as @puzrin recommends.Yes, that's all right. I think, here are questions from different domains:
We will have collisions, until spec content is not isolated from another domains. That's not first time it happens. Previous significant case was when spec example required to cleanup markup in image alt attributes. That's good for format convertors, but useless for markdown -> html processors.
So, i'd vote for clear separation of domain areas and keeping other things in separate test suites. Having all in single file (and without any status marks) creates parctical difficulties.
We will have collisions, until spec content is not isolated from another domains. That's not first time it happens. Previous significant case was when spec example required to cleanup markup in image alt attributes. That's good for format convertors, but useless for markdown -> html processors.
Thanks for reminding me of this issue, which is another problematic one. The AST will contain formatted content for the image description. The translation from this to an alt attribute without formatting is lossy.
So, how do we test a converter that only emits HTML against the spec? In general we can translate straightforwardly from HTML to the XML representation of the AST, but in this case not.
I agree with your general point that some kinds of tests don't belong in the spec. For example, we have a suite of tests of pathological cases (50,000 deep nesting etc.). And we have separate tests of the C API. But image descriptions are a central part of the syntax, and tests for them need to go in the spec tests.
I think, it's not a big problem to tags specific tests with additional conditions:
.ast, optional
src
.
result
.
I consider normal, if spec don't cover some edge cases, which will never happen in real life. If one render markdown to html, who cares about markup in alt attrs? That's like care about broken uri encoding - not needed.
+++ Vitaly Puzrin [Jan 14 15 16:47 ]:
I think, it's not a big problem to tags specific tests with additional conditions: .ast, optional src . result .
I don't think this helps with the image description issue. If we tag it as optional, then an implementation could just put anything there and pass.
I consider normal, if spec don't cover some edge cases, which will never happen in real life. If one render markdown to html, who cares about markup in alt attrs? That's like care about broken uri encoding - not needed.
But formatting inside the "image description" (as the spec calls it) is not an edge case. People will naturally write things like
![Image of *homo sapiens*](/human.jpg)
Tagging is very flexible approach. You can tag some tests as ast only, some as html only, other as implementation specific or proposals. Current tests are not categorized at all - can't suggest more detailed and optimized strategy until it's done.
If you agree in general, that spec tests should be resorted by nature, i see no significant problems with technical side. It was the main point of my posts. If you know some logical collision, preventing it - let's discuss (probably, it's time move it to forum).
https://github.com/markdown-it/markdown-it/tree/normalize
Play with that branch (make test
). I've tried to rewrite normalizer with using url
library. Got many broken tests from spec:
/
after domain name (http://example.com?abc
-> http://example.com/?abc
)\
with /
in query http://example.com?abc
https://github.com/joyent/node/commit/f7ede33f09187cd9b1874982e813380cd292ef17Workaround in tests needed, or we have to do completely custom url parser for markdown :(
As far as i see, this issue was resolved at both sides:
Close it, if you don't have other reasons to keep is open.
https://github.com/jgm/CommonMark/blob/master/js/lib/inlines.js#L190
It can throw exception:
Also it can produce non-standard urls for international domain names.
I plan migrate to https://github.com/defunctzombie/node-url instead of home-made magical kludges. That can also require to fix some tests content (didn't checked yet).
Normalization is not mandatory, but it's more simple to use well-tested solution instead of describing all edge cases and writing correct encoder.
For reference: