commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 532 forks source link

Rendering to CommonMark with a column width can create a heading #314

Open egrimley opened 4 years ago

egrimley commented 4 years ago

I see you've carefully avoided accidentally starting a list:

$ printf 'abcd x)' | cmark -t commonmark --width 4
abcd
x)
$ printf 'abcd 1)' | cmark -t commonmark --width 4
abcd 1)

But a heading can be accidentally created:

$ printf 'abcd =' | cmark -t commonmark --width 4 | cmark -t html
<h1>abcd</h1>
$ printf 'abcd -' | cmark -t commonmark --width 4 | cmark -t html
<h2>abcd</h2>

I think you can find this bug (and perhaps others) by writing a program that converts to HTML both directly and via CommonMark with width 4 and compares the two results and aborts if they're different, then fuzzing that program with AFL.

jgm commented 4 years ago

I see what the problem is; it's with the breaking algorithm in render.c. Essentially this keeps adding to a buffer until it's longer than the line length; then it goes back and finds a place to break the buffer. There is code to escape an = or - at the beginning of the line, but it never gets engaged because in this case the = gets laid down at the end of a line and only later relocated. No doubt a better algorithm can be designed; it's important to keep in mind, though, that it's possible for there to be no breaking spaces in several consecutive inlines.

egrimley commented 4 years ago

Would it be better to escape the ')' or '.' or whatever instead of not breaking the line?

That is, when the column width is 5, instead of

abc 1.
x

generate:

abc
1\. x

It might make things simpler because you need a mechanism to escape things at the start of a line anyway, but you can then avoid having a mechanism to avoid breaking lines in bad places.

There's another related bug which might be fixable in the same way. In this case it's the joining of two lines that creates a list where there wasn't one:

$ printf 'ab\n1.\nx' | cmark -t html
<p>ab
1.
x</p>
$ printf 'ab\n1.\nx' | cmark -t commonmark --width 4 | cmark -t html 
<p>ab</p>
<ol>
<li>x</li>
</ol>
egrimley commented 4 years ago

Perhaps the algorithm should be something like:

foreach (word) {
  if (!at_start_of_line) {
    x = escape_for_not_at_start_of_line(word);
    if (x fits on this line, taking account of a following
        LINEBREAK becoming "  " if there is one) {
      emit(x);
      continue;
    } else
      emit(newline);
  }
  // Here we would change "1." to "1\\.". We could decide
  // not to do that if we're at the end of a paragraph or
  // have a LINEBREAK next, but I'd say don't bother with
  // those special cases:
  x = escape_for_at_start_of_line(word);
  emit(x);
}
jgm commented 4 years ago

Would it be better to escape the ')' or '.' or whatever instead of not breaking the line?

Well, that is what the code was intended to do. I was explaining why it isn't, in fact, doing that. The check for "start of line" fails because the 1. is laid down prior to breaking the line, and so not at the start of the line; the current code goes back and inserts the line break retroactively but then it's too late for the escape.

foreach (word) {
  if (!at_start_of_line) {
    x = escape_for_not_at_start_of_line(word);
    if (x fits on this line, taking account of a following
        LINEBREAK becoming "  " if there is one) {
      emit(x);
      continue;
    } else
      emit(newline);
  }

The issue here is that we can't just emit a newline after the first element that doesn't fit on the line, since that would introduce breaks in places where there are no spaces. We can only break on a breakable space. And we might not encounter one til several more iterations through this loop.

egrimley commented 4 years ago

The issue here is that we can't just emit a newline after the first element that doesn't fit on the line, since that would introduce breaks in places where there are no spaces. We can only break on a breakable space.

In my pseudocode, word means an element surrounded by breakable spaces, or at the start or end of a paragraph.

However, the reference to escape_for_not_at_start_of_line probably doesn't make sense because that sort of escaping would have happened already: I was confused here about whether I was operating on strings of characters or bits of tree.

Two more cases perhaps worth mentioning:

  1. Part of a code span can turn into a thematic break:

    $ printf '`x *** x`' | cmark -t commonmark --width 5 | cmark -t html
    <p>`x</p>
    <hr />
    <p>x`</p>

    (You need to be careful about which spaces inside a code span are breakable, but I'd claim you should break the line earlier rather than go beyond the specified column width. So I'd suggest that in this case the "words" are `x *** and x`: inside a code span it's dangerous to break before a *; outside a code span it's safe to break because you would escape it after breakiing.)

  2. These two inputs seem to be equivalent but are rendered differently:

    $ printf 'aa 00' | cmark -t commonmark --width 4
    aa 00
    $ printf 'aa\n00' | cmark -t commonmark --width 4
    aa
    00
jgm commented 4 years ago

These two inputs seem to be equivalent but are rendered differently:

This is intentional. cmark will retain soft breaks in the input when it renders, unless you specify the --nobreaks option. (In that case these outputs will be the same.)

jgm commented 4 years ago

Currently we have this as the abstract renderer:

struct cmark_renderer {
  cmark_mem *mem;
  cmark_strbuf *buffer;
  cmark_strbuf *prefix;
  int column;
  int width;
  int need_cr;
  bufsize_t last_breakable;
  bool begin_line;
  bool begin_content;
  bool no_linebreaks;
  bool in_tight_list_item;
  void (*outc)(struct cmark_renderer *, cmark_escaping, int32_t, unsigned char);
  void (*cr)(struct cmark_renderer *);
  void (*blankline)(struct cmark_renderer *);
  void (*out)(struct cmark_renderer *, const char *, bool, cmark_escaping);
};

outc is provided by the calling module (e.g. commonmark.c) and it handles escaping. It can be sensitive to line position because this information is available in the renderer. But as noted above, this fails for wrapping because the character has already been laid down.

A cleaner option would be to have a separate buffer for characters, and only add them to the main buffer after we've inserted a break and ended the line. Then we'd know their true positions and could escape them appropriately. In this buffer we'd need to store both the character and the type of escaping -- thus, a pair of a cmark_escaping and an int32_t. (For efficiency, perhaps we could use the first 2 bits of the int32 to hold the cmark_escaping; then we could just have an array of int32_t.)