executablebooks / mystmd

Command line tools for working with MyST Markdown.
https://mystmd.org/guide
MIT License
172 stars 50 forks source link

HTML parsing breaks easily with whitespace #1332

Closed fwkoch closed 1 week ago

fwkoch commented 2 weeks ago

Description

Given this simple HTML:

<table><tr><th>Heading</th></tr>

<tr><td>Data</td></tr></table>

MyST fails. Rather than creating a single table, it creates a table for the first line and a tableRow for the second line. MyST attempts to handle this with the reconstructHtml transform - https://github.com/executablebooks/mystmd/blob/main/packages/myst-transforms/src/html.ts#L243 - but that function makes major assumptions about the shape of the HTML which are easily violated.

Proposed solution

We probably need an actual HTML parser in the reconstructHtml transform so we know if the block of HTML is entirely closed or if it still has open elements.

fwkoch commented 2 weeks ago

There should be an easy first pass on this that just combines sequential HTML nodes together if they are only separated by whitespace...

fwkoch commented 2 weeks ago

I can't actually reproduce this issue in any other case besides blank whitespace interruptions, so maybe we don't need more complicated html parsing... Once #1333 lands, I'll consider this closed - can always re-open if we encounter more issues!

fwkoch commented 2 weeks ago

Ok - it's still pretty easy to get html to fail, for example:

<b>blah

</b>

doesn't work - it just loses the bold tag. However, fixing that requires a deeper change to the actual parsing / html node creation. This issue just handles correct html that is split across mdast nodes...

fwkoch commented 1 week ago

The fix to resolve the specific example in the original issue has been released, so I am going to consider this closed.

As my previous comment suggests, there is potentially more work to be done for smarter html parsing, but that can wait until more real examples come up.