Open rekado opened 6 years ago
One issue with supporting raw HTML described by the CommonMark spec is the requirement of malformed HTML which there is no way to convert to sxml. For example:
<foo>
<bar>
</foo>
</bar>
Will require the output to be exactly the same. We can not support this in sxml as we can only create valid XML.
When I originally started the project, I recall the spec mentioning HTML blocks do not need to be supported when the output is something other than HTML. So I did not bother implementing HTML blocks or inline HTML.
So my options are to either avoid using sxml and following the CommonMark spec and output HTML or to go off the spec and only allow balanced HTML nodes in HTML blocks and inline HTML. I believe I want to pursue both options, but with more focus on the sxml output. It is on my todo list for guile-commonmark after I update this project to the latest version of the spec.
Hi @OrangeShark! Is there any progress on this? I think it would be nice to support raw html in some way or another. I guess you are not very keen on avoiding sxml altogether?
I suppose a simple way around the problem is to parse inline HTML and print it as text if it is invalid. This would make certain use cases for raw HTML blocks impossible (such as generating head and tail fragments to wrap some other content), but it seems like a small loss compared to not having any HTML block support.
guile-lib contains a "pragmatic" html parser
I wonder if it could be of help, here
It "attempts to recover structure"
"The HtmlPrag parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error."
A point of doubt for me is this one:
"Note that valid XHTML input is of course better handled by a validating XML parser like [SSAX]."
I wonder if guile-commonmark could switch to a parser or another depending on the correctness of the material at hand
I just ran into this myself. @OrangeShark are there any plans to add support for raw HTML? I'm happy to get involved if it's just a matter of developer time.
Soooo I've been working through the problems here this past week and I think I am close to solutions.
Since the CommonMark format allows embedding any arbitrary HTML, the means that the resulting AST does not reflect the shape of the HTML node tree, in the general case. So, as noted above, you cannot directly convert a CommonMark AST to SXML when block/inline HTML nodes are present. You have to serialize to HTML first and then parse that.
I propose the following:
1) Add support for parsing block and inline HTML
2) Allow conversion of CommonMark AST to HTML text by providing a commonmark->html
procedure in a new (commonmark html)
module
3) For compatibility reasons, convert raw HTML nodes to simple text nodes in commonmark->sxml
I think item 3 is particularly important because it will allow guile-commonmark to continue to work as it does today, without support for embedded HTML. The new commonmark->html
interface will allow users to directly serialize to HTML (which is enough for many use-cases) or use their preferred HTML parser to convert it to SXML, such as guile-lib's (htmlprag)
(which is what I'd want to do in Haunt). This avoids adding dependencies to guile-commonmark and punts on the complicated subject of HTML parsing (it's a user problem!)
I have a WIP branch that can parse block and inline HTML that is close-but-not-quite compliant with the spec. I also have a commonmark->html
serializer. I'm working on adding a bunch of test cases from the CommonMark spec and tweaking code as I find issues. I hope to open a PR soon.
The CommonMark Spec recognizes HTML blocks, i.e. "a group of lines that is treated as raw HTML (and will not be escaped in HTML output)." See http://spec.commonmark.org/0.26/#html-blocks.
Guile-commonmark does not seem to support these kinds of blocks.