commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 527 forks source link

The commonmark renderer's escaping strategy is very aggresive #131

Open MathieuDuponchelle opened 8 years ago

MathieuDuponchelle commented 8 years ago

I am currently evaluating the usefulness of cmark as an indentation / formatting tool.

The width parameter makes cmark already quite suitable, though I would remove all soft breaks and normalize before rendering in case HARDBREAKS wasn't passed, this would avoid such cases:

A very long line that the user manually broke because it was really getting
too long

rendering to

A very long line that the user manually broke because it was
really getting
too long

Anyway, my real issue here is that a lot of characters are preemptively escaped, even though they could actually pass through unescaped. I understand that the initial intent was (and rightly so) "better safe than sorry", but I think we should now look back at how to be more clever wrt these characters, starting with the most likely to be used in a common sentence.

Case in point: !

Currently given that input:

Hello amigo!

we have this output when running cat hello.md | ./build/src/cmark -t commonmark

Hello amigo\!

This was introduced by efeb7093a3e6a96019b0805fc630a7aa4c31481b. The specific reason for this is not stated in the commit message, but when resetting cmark to the previous commit, and running make roundtrip_test , it becomes clear:

Example 533 (lines 7317-7323) Images
\![foo]

[foo]: /url "title"

--- expected HTML
+++ actual HTML
@@ -1 +1 @@
-<p>!<a href="/url" title="title">foo</a></p>
+<p><img src="/url" alt="foo" title="title" /></p>

589 passed, 1 failed, 0 errored, 0 skipped

I haven't investigated why, but I suspect the roundtrip tests are broken in current master, as even when deactivating all escaping in the renderer, the roundtrip_executable test still happily reports success. The issue can be reproduced with current master by "reverting" this commit (might conflict, edited the case out) and running rountrip.sh manually.

Furthermore, some code was obsoleted in this commit (but still lives in the current HEAD):

needs_escaping == [...] c == '!' || [...] || (c == '!' && nextc == '[')

This should be removed, but helps us understand the issue at hand, and the shortcomings of the current strategy:

The previous code tried to determine whether ! should be escaped by looking at the next character in the literal stream it was parsed from, bounded by its containing block limits. However here, the next character is NULL, and the issue was triggered by the next character we output, which was '['.

The fix that was introduced for this escaped all ! , end of story. However I think a more appropriate fix would be to escape it "retroactively": the only case I'm aware of where '!' can have a semantic meaning is when it is followed by '[', as such we should, in the case where the renderer ends up outputting it on its own (case CMARK_NODE_LINK), be able to verify that the previous character that was output was a "!", and only then insert an escape character in the strbuf. We should obviously make sure to escape it only if it was contained in a TEXT node, but that shouldn't be too hard.

I haven't studied the code further yet, because I thought this was the most glaring issue, however I also strongly dislike the underscore escaping, because it requires function_names to be wrapped in a code inline in order to escape escaping.

We should determine the conditions under which most if not all of these characters need to be escaped, and try to determine if we can come up with a better strategy, in order to make cmark more usable as a source-to-source tool, I hope you folks can help me study each case :)

jgm commented 8 years ago

Good catch! Round-trip tests were failing even without any change to escaping. It was not noticed because of a bug in the CMakeLists.txt file. I have fixed this issue.

jgm commented 8 years ago

I'm sympathetic to the general concern, as well. Certainly we should try not to add escapes when it isn't necessary. It should be trivial Maybe your idea would work to fix the overly aggressive escaping of !, but * and _ are much harder; you'd essentially have to parse the proposed output again to see if a given * or _ is going to need escaping. One simple thing we could do, though: word-internal _ should never need escaping.

MathieuDuponchelle commented 8 years ago

Good to know you're interested!

I am currently in the (slow and boring process) of converting around 40 50 pages+ docx files that we used to communicate with our client to commonmark (+ hotdoc), which will let us use standard tools to review proposed changes among many other pretty obvious advantages, anyway once that is done I expect our need for this will become more immediate (and also for reference links to make the roundtrip, but there's already an issue open for that and that should be trivial).

In the meantime, we should continue trying to figure out the problematics for each of these symbols, ideally with examples.

You say * and _ are much harder, I can imagine how, given the sophisticated inline parsing process for emphasis, however coming up with specific examples which would currently fail if they were not automatically escaped might help us figure out clever ideas, or make it clearer why a new full inline parsing round would be required.

jgm commented 8 years ago

I don't understand this comment:

The width parameter makes cmark already quite suitable, though I would remove all soft breaks and normalize before rendering in case HARDBREAKS wasn't passed, this would avoid such cases...

I don't see the behavior you're describing.

 % ./cmark -t commonmark --width 30
A very long line that the user manually broke because it was really getting
too long
^D
A very long line that the user
manually broke because it was
really getting too long

Note that you're not getting a break after "getting" here.

MathieuDuponchelle commented 8 years ago

Gimme a sec I'll try to reproduce, I had a case that was blatantly wrong but I made that example on the spot and it would require a specific width (which I did not calculate) to make it obvious

MathieuDuponchelle commented 8 years ago

Would be very nice if you did hang out in some IRC channel btw :P

MathieuDuponchelle commented 8 years ago

Hm I'm sorry, I should learn to read the output of diff:

422,423c432,434
<   - Implement zoom via pitch gestures in Google Earth without having access to
<     its source code
---
>   - Implement zoom via pitch gestures in Google Earth without having
>     access to its source
>     code

I thought the new file was at the bottom, consider I said nothing :)

MathieuDuponchelle commented 8 years ago

By the way the bottom one is a quirk of pandoc, I often see that at the bottom of files it translates to markdown

jgm commented 8 years ago

If you can find a reproducible example, submit it as a bug to jgm/pandoc.

+++ Mathieu Duponchelle [Jun 01 16 15:14 ]:

By the way the bottom one is a quirk of pandoc, I often see that at the bottom of files it translates to markdown

— You are receiving this because you commented. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/jgm/cmark/issues/131#issuecomment-223141306
  2. https://github.com/notifications/unsubscribe/AAAL5ITUMVJDaT4j1d27_vm2cvtYuYCTks5qHgRIgaJpZM4IrE-3
MathieuDuponchelle commented 8 years ago

Will do

jgm commented 8 years ago

I have

  1. rewritten the roundtrip test suite so that it doesn't raise spurious failures
  2. fixed a bug in the commonmark renderer

so now all tests pass again.

jgm commented 8 years ago

Back to the issue of aggressive escaping of !. Since there are some extraneous issues in this thread, I'll quote the most relevant part of your comment above.

The previous code tried to determine whether ! should be escaped by looking at the next character in the literal stream it was parsed from, bounded by its containing block limits. However here, the next character is NULL, and the issue was triggered by the next character we output, which was '['.

The fix that was introduced for this escaped all ! , end of story. However I think a more appropriate fix would be to escape it "retroactively": the only case I'm aware of where '!' can have a semantic meaning is when it is followed by '[', as such we should, in the case where the renderer ends up outputting it on its own (case CMARK_NODE_LINK), be able to verify that the previous character that was output was a "!", and only then insert an escape character in the strbuf. We should obviously make sure to escape it only if it was contained in a TEXT node, but that shouldn't be too hard.

I can think of two general solutions here.

  1. Allow escapes to be inserted retroactively, as you suggest. This seems a bit fragile. The render.c code doesn't know if we're outputting a link; it just gets instructions like "output a literal [foo]" or "output an escaped [foo]". And if we get "output a literal [foo]" we really don't know whether the preceding ! should be escaped. After all, the calling library might be trying to create an image with two separate calls, LITERAL "!" and LITERAL "[foo]". It would be fragile to rely on a convention that the calling library not do that.
  2. So, a better solution, I think, would be to rewrite the render.c code and interface. Currently, when we call the renderer->out function, it actually writes something to the output buffer. We could change that so that renderer->out builds up a stream (e.g. linked list) of instructions: [ESCAPABLE "!", LITERAL "[foo]", ...]. Then, the calling program calls renderer->finalize (or something), and only at this step is the output buffer filled. The renderer can traverse the stream of instructions and use its knowledge about what comes next to determine how to escape ESCAPABLE elements.

How does (2) sound to you? I think this would yield a fully adequate solution to the problem of overly aggressive escaping for !. I think that [, *, and _ would still have to be escaped more often than necessary, because it's too hard for the renderer to determine whether something would be parsed as a link or emphasized span. But at least we could omit escaping for (e.g.) a * or _ that is preceded and followed by a space, and for a _ that is internal to a word.

MathieuDuponchelle commented 8 years ago

I will need to do some context switching and look back at the code to give a more informed opinion, my turn to have busy weeks :)

If we decide to tackle an overall refactoring of the commonmark rendering process, I think proving as formally as possible that "unnecessary" escaping of emphasis and link symbols indeed is necessary, and why. My use case for this is to have the renderer be as passthrough as possible, then offer some switches to enable enforcing style conventions. For example, bullet lists should use either '*' or '-' depending on the original text, except if the user specifies that all bullet lists should be styled in the same way.

Ideally, I'd like to be able to use cmark as a git pre-commit hook :)

jgm commented 8 years ago

+++ Mathieu Duponchelle [Jun 24 16 11:10 ]:

If we decide to tackle an overall refactoring of the commonmark rendering process, I think proving as formally as possible that "unnecessary" escaping of emphasis and link symbols indeed is necessary, and why.

I think it's pretty clear that unnecessary escaping will be necessary. Take any complex emphasis parsing case from the spec, and put it into a TEXT node. There will be no way to determine whether escaping is necessary, in general, without reparsing the proposed output and comparing the parsed AST with the original AST. Even if we did that (incredibly expensive) check, it would not tell us where escapes needed to be inserted. I think we just have to live with some unnecessary escaping (or give up the desideratum of reliable round-tripping).

My use case for this is to have the renderer be as passthrough as possible, then offer some switches to enable enforcing style conventions. For example, bullet lists should use either '*' or '-' depending on the original text, except if the user specifies that all bullet lists should be styled in the same way.

On this, see #125. However, there are lots of other things (like indentation of lists) besides the bullet type that would need to be preserved for "minimum mutilation." I'm pretty reluctant to include all that concrete stuff in the AST, as you can see from the talk page linked from #125.

MathieuDuponchelle commented 7 years ago

The more I think about this, the more I reach the conclusion that the only possible fix would be to adopt a somehow different parsing strategy, with actual tokenization, and a second pass comparable to what clang does with clang_annotateTokens, however I only have a high level, pretty naive idea of what I would expect to end up with :(

@jgm , what made you give up on the strategy you used in https://github.com/jgm/peg-markdown/blob/master/markdown_parser.leg ?

jgm commented 7 years ago

@MathieuDuponchelle - I'm not sure what part of peg-markdown you think addressed the particular problem in this issue. peg-markdown constructs an AST just like cmark, so in rendering it to markdown again we'd face the same issues about escaping.

peg-markdown is a lot slower than cmark and falls down on some of the pathological tests (exponential behavior). It also has some parsing flaws, though off the top of my head I can't recall what they are. (Obviously, also, it doesn't parse to the CommonMark spec.)

Perhaps instead of inserting string literals into the AST, we should insert pairs of offsets into the original source text (multiple pairs would be needed, in general, because the string contents of an AST node aren't always continguous in the source). Then we could peek into the source to determine things like which emphasis character was used, whether a character was escaped, etc. Was that the sort of thing you had in mind?

MathieuDuponchelle commented 7 years ago

Not necessariy, the reason I mentioned peg-markdown was that it seems to use a more "classical" approach to parsing, though now that I read the source a bit more it doesn't seem to have a strictly separated lexing phase.

The sort of thing I have in mind would be a first phase where lexical tokens are extracted, from which the original text could still be exactly reproduced, and a grammar parsing phase, similar to what the clang_annotateTokens function does. It would then be possible to modify the AST and dump it again, and as far as I can see only the modified bits might have to be escaped.

As I said, that's pretty high level :)

jgm commented 7 years ago

+++ Mathieu Duponchelle [Nov 22 16 14:37 ]:

Not necessariy, the reason I mentioned peg-markdown was that it seems to use a more "classical" approach to parsing, though now that I read the source a bit more it doesn't seem to have a strictly separated lexing phase.

The sort of thing I have in mind would be a first phase where lexical tokens are extracted, from which the original text could still be exactly reproduced, and a grammar parsing phase, similar to what the clang_annotateTokens function does. It would then be possible to modify the AST and dump it again, and as far as I can see only the modified bits might have to be escaped.

As I said, that's pretty high level :)

The problem is that with Markdown there's very little you can do by way of tokenizing prior to parsing the grammar, because all the characters that can be syntax delimiters can also just be plain characters, depending on context.

Specific suggestions welcome, of course.

MathieuDuponchelle commented 7 years ago

That's for sure, but there shouldn't be a problem with having these tokens constitute a subsequent Text ast node right ? And in the case where they are interpreted as having semantic meaning and constituting another kind of node, then we'd still have the "source mapping" available.

jgm commented 7 years ago

+++ Mathieu Duponchelle [Nov 22 16 16:14 ]:

That's for sure, but there shouldn't be a problem with having these tokens constitute a subsequent Text ast node right ? And in the case where they are interpreted as having semantic meaning and constituting another kind of node, then we'd still have the "source mapping" available.

Sorry, I'm a bit lost about what precisely you're proposing.

MathieuDuponchelle commented 7 years ago

Understandable, as I haven't been very specific, seems I'm gonna have to be, at the risk of looking dumb, damn :P

So the way I would see this working is given this input:

[foo](bar)

the "tokenized representation" would be (pseudo enum):

LEFT_SQ_BRACKET,STR,RIGHT_SQ_BRACKET,LEFT_BRACKET,STR,RIGHT_BRACKET

The ast would obviously look like:

Document
    Paragraph
        Link
            Text

The process followed by the commonmark writer would be as followed:

If the API was used to update the link destination, there should be logic to update the token list, in that case STR, with a new set of tokens. That way, as I see it, formatting (including whitespace), could be preserved, thus making the commonmark writer appropriate for source to source transformation.

In the above example, the link node would be aware of the extent of the tokens that constitute its destination, for example given:

[foo](ba[r)

The token list would be:

LEFT_SQ_BRACKET,STR,RIGHT_SQ_BRACKET,LEFT_BRACKET,STR,LEFT_SQ_BRACKET,STR,RIGHT_BRACKET

and the destination's extents would be:

STR,LEFT_SQ_BRACKET,STR

Updating the destination would thus replace these three tokens with new ones.

MathieuDuponchelle commented 7 years ago

Actually the correct approach for rendering back to commonmark, now that I think of it a bit more, would simply be to dump back the entire, potentially updated list of tokens. Bonus points if there's a notion of "dirtiness" on this list of tokens, to allow updating the source positions once after a set of changes.

Other example:

[foo]

bla bla

[foo]: bar

With the proposed approach, it wouldn't matter that the list of tokens making up the link destination are not contiguous with the rest of the link tokens, and reference links would be preserved.

@jgm, am I making sense here?

jgm commented 7 years ago

+++ Mathieu Duponchelle [Nov 24 16 16:07 ]:

Actually the correct approach for rendering back to commonmark, now that I think of it a bit more, would simply to dump back the entire, potentially updated list of tokens.

You say "rendering back." But the commonmark renderer doesn't presuppose that the node tree it gets as input came from commonmark source. It might have been constructed programatically, or parsed from LaTeX, for example.

MathieuDuponchelle commented 7 years ago

Hm, that's indeed the use case I've got in mind, I guess it shows :) However as I see it, there would be no obstacle to constructing the entire ast programatically with this approach, I'm not sure what you mean by "parsed from LaTeX" however, is there any intention to have cmark support multiple input formats ?

jgm commented 7 years ago

+++ Mathieu Duponchelle [Nov 25 16 08:41 ]:

Hm, that's indeed the use case I've got in mind, I guess it shows :) However as I see it, there would be no obstacle to constructing the entire ast programatically with this approach, I'm not sure what you mean by "parsed from LaTeX" however, is there any intention to have cmark support multiple input formats ?

I have in fact already used cmark with HTML input: https://github.com/jgm/html2cmark This goes via lua bindings to cmark. We use an HTML parser to parse HTML5, then go through the parse tree and construct a cmark node tree, then render it using the commonmark renderer.

MathieuDuponchelle commented 7 years ago

Right, so that's also programmatic building of the AST, and you wouldn't a priori care about using libcmark to do html -> html passthrough I guess ;)

MathieuDuponchelle commented 7 years ago

To be clear, nothing prevents from having a commonmark output of a programatically built AST using the approach I tried to lay out:

MathieuDuponchelle commented 7 years ago

btw

Then we could peek into the source to determine things like which emphasis character was used, whether a character was escaped, etc. Was that the sort of thing you had in mind?

This could very well work :)

As far as I can tell, we could not do a generic "offsets" field in the node structure, but instead would need node-specific code.

I think the first step would be to not strip stuff and forget about it during the block parsing phase, like whitespaces, new lines or trailing hashes in heading lines, for example:

# foo ##

The heading node would hold two new fields:

heading.begin_marker = (0, 2) -> '# ' heading.raw_title = (2, 8) -> 'foo ##' heading.end_marker = (8, 9) -> '\n'

heading.content could continue holding 'foo', at least for now.

The clear advantage of this approach is that it's not intrusive, and can be implemented incrementally, I'll try that once I get the occasion :)

davidcelis commented 1 year ago

I know that this issue is quite old at this point, but is there any chance we could renew attention on it? I recently ended up writing my own "Commonmark Re-renderer" because of the aggressive escaping that cmark performs when outputting a parsed markdown AST back into markdown proper. It would be great if the ! was only escaped in the one case where it has meaning (which has already been called out as the ![]() image embed syntax).

Better yet, it would be great if there were a built-in way to output back to commonmark with no escaping. Essentially what I'd like is to be able to parse Commonmark with certain options (e.g. smart punctuation and unsafe HTML) and then be able to re-output markdown (with transformations from those options still applied) that can be re-parsed to produce the same AST, regardless of whether or not there are !s included in the text, without having to try to be clever about removing errant backslashes.

jgm commented 1 year ago

This commit should help with !, but it could still be improved. ! will still be escaped:

Unfortunately the current node-by-node rendering system doesn't make it easy to improve this further.

davidcelis commented 1 year ago

Dang, that's some fast turnaround! I appreciate any improvement, even if there's further to go. Thank you for taking a quick chip at it!