Consider switching to a lighter-weight XML parser for Lambxml

darioteixeira / lambdoc

Lambdoc is a library providing support for semantically complex documents in Ocsigen web applications

GNU General Public License v2.0

17 stars 1 forks source link

Consider switching to a lighter-weight XML parser for Lambxml #27

Closed darioteixeira closed 9 years ago

darioteixeira commented 9 years ago

Lambxml currently uses PXP, which is way too massive and complex for our needs. I think the Lambxml parser would be simpler, lighter, and more performant by switching to something like Xmlm.

Drup commented 9 years ago

In tyxml, we need to replace the current html parser for a camlp4-free version (in order to have a nice ppx). I know that @Eyyub started something (https://github.com/Eyyub/xmltoty), but I don't know its state of completion.

It would clearly be great to share the same html parser.

EDIT: oh, I noticed you are talking about lambxml, not raw html. My point stand nevertheless to be able to embed html using the "side effect".

eyyub commented 9 years ago

(It's the end of my internship so I will work on xmltoty soon, it will now produce an OCaml AST rather than a string)

Yes it would be cool to have the same html parser.

darioteixeira commented 9 years ago

We may be able to share the parser depending on its architecture. Will xmltoty be a generic XML parser with a XHTML5 layer on top, or built from the ground up to just parse HTML5? (The HTML5 accepted by browsers is very lenient...)

eyyub commented 9 years ago

Actually xmltoty(which is now ppx_tyxml), just use Xmlm to parse HTML, it's not an XML parser, ppx_tyxml takes an XML-string and outputs an OCaml Ast containing TyXML's function call. But as you said, there is some inconvenients to use an XML parser to parse HTML(e.g <br> is forbidden). Does lambdoc need an HTML parser ?

darioteixeira commented 9 years ago

Does lambdoc need an HTML parser ?

Not really. Though the intersection between Lambxml and HTML is significant, there are large enough differences that using a vanilla HTML parser for Lambxml would be problematic.

Anyway, I'll be using Xmlm for parsing Lambxml too. I intend to push that code soon.

darioteixeira commented 9 years ago

Unfortunately, there's a serious snag with Xmlm: there's no clean way to copy an XML sub-tree verbatim, which is a requirement for Lambxml (which accepts embedded MathML fragments which should not be touched). The migration to Xmlm must therefore be put indefinitely on hold.