dbuenzli / cmarkit

CommonMark parser and renderer for OCaml
https://erratique.ch/software/cmarkit
ISC License
46 stars 8 forks source link

Disappearing strings when using code spans inside table cells #10

Closed jchavarri closed 1 year ago

jchavarri commented 1 year ago

Thanks for creating this library, we are experimenting with it to add Reason syntax code blocks (auto-generated from OCaml syntax blocks) into markdown documents, so far with promising results.

I found a strange behavior when dealing with table cells that contain code spans. The parser seems to remove part of the text coming after the first code span.

Here is a small repro, using the released version 0.2:

# let t = {|
|     Foo    |
|------------|
| `a` or `b` |
|};;
val t : string = "\n|     Foo    |\n|------------|\n| `a` or `b` |\n"

# let doc = Cmarkit.Doc.of_string ~layout:true ~strict:false t;;
val doc : Cmarkit.Doc.t = <abstr>

# let output = Cmarkit_commonmark.of_doc doc;;
val output : string = "\n|     Foo    |\n|------------|\n| `a``b`|\n"

# let o = Cmarkit_html.of_doc ~safe:true doc;;
val o : string =
  "<div role=\"region\"><table>\n<tr>\n<th>Foo</th>\n</tr>\n<tr>\n<td><code>a</code><code>b</code></td>\n</tr>\n</table></div>"

Note how or and the spaces around it are removed from the output when using either html or commonmark renderer. The issue seems to happen only when having two or more code spans. It also affects to the text between the spans, not any text at the beginning or the end of cell outside spans.

I understand that tables are not part of the official Commonmark specification, but is this expected behavior?

jchavarri commented 1 year ago

Another scenario I just ran into. Unsure if the root cause is the same but the symptoms are similar.

When parsing a table cell with some html, like <p>foo</p> the text inside will be erased when printing the ast: <p></p>.

dbuenzli commented 1 year ago

Could you please provide a repro, not sure exactly what you mean.

I nailed down the first bug: text is not being added to the list here. I'm just chasing a space missing between the last inline and last text node. This:

|  Foo                    |
|-------------------------|
| `a` or `b`              |
| before `a` or `b` after |

Now round trips to:

|  Foo                    |
|-------------------------|
| `a` or `b`             |
| before `a` or `b`after |
dbuenzli commented 1 year ago

Ah of course, next is next. The next + 1 should be next here. It's all very localized :-)

Now I'm also generating empty text nodes between two non-text inlines like in `a`*b* which is bad: Doc.of_string says you should :-) get normalized inlines.

dbuenzli commented 1 year ago

When parsing a table cell with some html, like <p>foo</p> the text inside will be erased when printing the ast: <p></p>.

Ah yes this is the same bug. Markdown sees that as raw html, text node, raw html and since those toplevel text nodes were being dropped.

I think everything is fixed now (see commits) so I'm closing. Please confirm if that works for you.

jchavarri commented 1 year ago

I can confirm that when using commit f37c8ea86fd0be8dba7a8babcee3682e0e047d91, both issues are gone. Thanks for the swift fix!

dbuenzli commented 1 year ago

Excellent thanks ! File issues while I still semi-have this unpleasant stuff in my mind.