Closed darioteixeira closed 9 years ago
In tyxml, we need to replace the current html parser for a camlp4-free version (in order to have a nice ppx). I know that @Eyyub started something (https://github.com/Eyyub/xmltoty), but I don't know its state of completion.
It would clearly be great to share the same html parser.
EDIT: oh, I noticed you are talking about lambxml, not raw html. My point stand nevertheless to be able to embed html using the "side effect".
(It's the end of my internship so I will work on xmltoty soon, it will now produce an OCaml AST rather than a string)
Yes it would be cool to have the same html parser.
We may be able to share the parser depending on its architecture. Will xmltoty be a generic XML parser with a XHTML5 layer on top, or built from the ground up to just parse HTML5? (The HTML5 accepted by browsers is very lenient...)
Actually xmltoty(which is now ppx_tyxml), just use Xmlm to parse HTML, it's not an XML parser, ppx_tyxml takes an XML-string and outputs an OCaml Ast containing TyXML's function call.
But as you said, there is some inconvenients to use an XML parser to parse HTML(e.g <br>
is forbidden).
Does lambdoc need an HTML parser ?
Does lambdoc need an HTML parser ?
Not really. Though the intersection between Lambxml and HTML is significant, there are large enough differences that using a vanilla HTML parser for Lambxml would be problematic.
Anyway, I'll be using Xmlm for parsing Lambxml too. I intend to push that code soon.
Unfortunately, there's a serious snag with Xmlm: there's no clean way to copy an XML sub-tree verbatim, which is a requirement for Lambxml (which accepts embedded MathML fragments which should not be touched). The migration to Xmlm must therefore be put indefinitely on hold.
Lambxml currently uses PXP, which is way too massive and complex for our needs. I think the Lambxml parser would be simpler, lighter, and more performant by switching to something like Xmlm.