aantron / markup.ml

Error-recovering streaming HTML5 and XML parsers
http://aantron.github.io/markup.ml
MIT License
146 stars 16 forks source link

Compile to JS and publish in NPM #22

Open aantron opened 7 years ago

aantron commented 7 years ago

It would be kind of cool if Markup.ml turned out to be the fastest HTML parser in JS – good PR for OCaml/ReasonML. But I don't know if this can be expected :) Maybe it's the slowest :)

Also, I didn't find an HTML5 parser for JS that was actually HTML5-compliant, in an earlier search. This is important because the actual HTML5 specification contains extensive error-recovery logic, which is the result of long discussions about best practices, held between the very well-informed people who created the HTML5 spec.

Not sure how huge the HTML parser would be when compiled to JS, though.

TheSpyder commented 7 years ago

Using js_of_ocaml compiles Markup.ml to JS out of the box, but that's what it's designed to be able to do. Unfortunately the file is 262k of minified JS. JSOO does tend to produce quite large files though due to the way it works.

Bucklescript is going to be tricky at the moment. The setup is easy; it compiles from source using npm-based dependencies, so you just add package.json and bsconfig.json files to the repo describing where the source files are. It wasn't too hard to set up a folder structure that had markup.ml checked out as a dependency of a test project, and uutf checked out as a dependency of markup.

Unfortunately Bucklescript is still based on OCaml 4.02.3, so it's missing Uchar which was added in 4.03. That probably rules out compiling Markup.ml until Bucklescript is upgraded.

I don't know whether it's HTML5 compliant, but I've compiled a cut down version of nethtml (in the ocamlnet library) to JS using Bucklescript and it's 57k minified.

TheSpyder commented 7 years ago

hmm, I hadn't actually looked at the Makefile - I can see pre-4.03 compilation options in there and I just compiled it on 4.02.3 😂 Maybe it can work with bucklescript today. We'll need to figure out how to get uutf working as well though.

I don't know much about how the different compilation targets work but maybe there will need to be a pre-processing script that lets bucklescript compile it?

aantron commented 7 years ago

Markup.ml definitely supports OCaml < 4.03, as you saw :) Indeed, it builds down to 4.01, and its own codebase inherently (and deliberately) builds down to 3.12, uutf just forces 4.01. I don't know if anyone cares about versions that old, but I figured I don't need any recent features.

Let me know if you want any assistance looking into this. I'm not sure about what preprocessing would be necessary – just haven't looked into yet. My hope would be that none is necessary, as none is necessary on OCaml...

On the subject of uutf, I actually have a branch that eliminates it in favor of some custom CPS-based code. It gives a pretty big performance boost on native code (30% IIRC) by being customized for exactly what Markup.ml needs (it's not for general use, though). I don't remember the details of why the speedup right now, though. The performance might not translate to JS. However, if you get stuck on uutf somehow, we could look into finalizing that branch.

Regarding Nethtml, it's definitely not HTML5 compliant. I actually am not sure if it's compliant with anything, but there wasn't much of an HTML spec before HTML5.

In fact, I started with also a cut-down Nethtml in Lambda Soup, and then replaced it exactly for that reason. Besides not truly parsing HTML and not having error recovery, Nethtml is (or was at the time, still is AFAIK) limited to ASCII only (or Latin-1?), which was a real blocker. It also interpreted some attributes incorrectly, did not support namespaces, and did not do well on embedded SVG. So, I wrote Markup.ml to replace it in Lambda Soup.

Because it doesn't implement the giant state machines of HTML, though, Nethtml is a much smaller parser, and it will probably always be smaller when compiled to JS. However, I have a long-standing task to optimize Markup.ml for both speed and size, so I'd be happy to work on that. Part of the reason for writing a thorough test suite was to be able to change the internal representation as needed.

TheSpyder commented 7 years ago

I'm not sure if preprocessing is required either. At this point I probably just need to understand more about how Markup.ml compiles, specifically when Uchar is not in the stdlib. I can see that uutf leverages on the uchar compatibility library, but I'm not sure how it's working with Markup.ml unless it's picking that same library up transitively.

As a starting point I need to constrain which files Bucklescript deals with; by default it grabs everything in the source folder which means it's also failing on translate_entities.ml due to the lack of Yojson. It shouldn't be trying to compile that.

I'll see what I can come up with after digging through the build processes. Eliminating uutf would certainly make things easier :)

Nethtml doesn't explicitly support unicode as far as I can tell but I'm finding that it doesn't corrupt the characters either. It's very lenient with HTML which can be annoying but I'm working with fairly fixed source document structure.

Size is definitely a positive for nethtml. Even with JSOO the 57k of bucklescript code I mentioned earlier only bloats out to 80k.

aantron commented 7 years ago

Yes, Uchar is picked up transitively. Maybe I should add it to opam in the future. I recommend deleting translate_entities.ml, entities.json, and all the *lwt* files. You can also get rid of the test cases, of course.

The structure of Markup is a bunch of implementation modules Markup_*, all "hidden" behind one module, Markup, that collects them all and provides the API. The *Lwt* modules make up their own top-level modules for now, for packaging reasons, but as mentioned, you should omit them.

I imagine, modulo the Uchar problem, this structure should be relatively easy to compile. Markup_stream_io has some dependencies on I/O functions in Pervasives, but everything else is basically pure with respect to I/O.

This may also be somehow useful.

aantron commented 7 years ago

Concerning size, it would take work, but it's possible to replace the implementation with all sorts of things, like tables, etc., but of course it's better to see what the starting point for size is first :)

You can also save on size by omitting the XML parser and writer, so Markup_xml_tokenizer, Markup_xml_parser, Markup_xml_writer. Likewise with the HTML writer. The main value in Markup.ml is the HTML parser.

I had a thought previously to break up Markup.ml into a bunch of packages, so you can just pick and choose what you want.

aantron commented 7 years ago

(I "love" reading my own docs, now I have like ten TODOs of things to improve in CONTRIBUTING.md).

aantron commented 7 years ago

In case you already followed the link to CONTRIBUTING.md, I just cleaned it up a bit.

TheSpyder commented 7 years ago

OK, I've made decent progress. It was all fairly obvious once I realised uchar was a transitive dependency. I checked out markup.ml, uutf and uchar in the right places and added bsconfig.json files to make it all hook together. I now have it compiling without errors 👍

Whether I delete the xml references or not doesn't make any difference to the final size; I'm using rollup.js to bundle the files together and it does tree-shaking. None of the _xml references appear in the bundled JS at all.

My test app was simply:

open Markup

let test = string "<html><body><p>hi</p></body></html>" |> parse_html |> signals |> pretty_print |> write_html |> to_string

let () = print_endline test

The size difference over JSOO was less than I expected, only about 15% (268483 vs 228259 bytes). And Bucklescript is getting something wrong, there's a JavaScript error thrown in the HTML writer when the queue (let rec queue = ref next_signal) ends up with a JS undefined as the value and then the Bucklescript curry implementation falls over.

I'm not sure where to go from here. Due to the slapped-together nature of what I've done I can't easily make a branch to show any of this, although maybe I could do it with git submodules? 🤔

TheSpyder commented 7 years ago

Didn't need submodules in the end, I just added basic package.json files to my github forks along with the bsconfig.json. At that point npm install works ;)

https://github.com/TheSpyder/bs-markup-test

aantron commented 7 years ago

A branch in a fork would have worked, but a repo is fine :)

Let me know what you think is needed to get this into master. It's fine to have parallel build systems for a while, if needed. If you want me to do anything to help, like untangle the Makefile, I can do that.

TheSpyder commented 7 years ago

I did make branches for the extra files needed to work with bucklescript: https://github.com/TheSpyder/markup.ml/tree/bucklescript https://github.com/TheSpyder/uutf/tree/bucklescript https://github.com/TheSpyder/uchar/tree/bucklescript

My repo is just a test app linking to them. This is where removing uutf would help so we wouldn't need to get a PR approved on that repository ;)

Before we think about merging this though we need to report the bug in the bucklescript compiler. I mentioned that I'd hit a problem on discord and (as expected) was told it would really help if a smaller reproduction case can be found.

I'm not sure how to even figure out what the problem is though much less build a simple case. Maybe if we put our heads together on it - I can definitely translate the JS stack into ocaml file references, and probably even step through the code to see what the surrounding conditions are before the error.

TheSpyder commented 7 years ago

The developer of bucklescript picked this up and found the issue without us needing to build a simple test case. It's an edge case in recursion, apparently. There should be a fix available soon!

aantron commented 7 years ago

Great!

Is there a PR/issue/discussion somewhere (Discord? haven't looked in a few days :/). I'm curious to know what the edge case is.

Armael commented 7 years ago

Hi. I'm not sure what the state of this discussion is, but I can report on my recent experience of using markup with js_of_ocaml (wrt bucklescript, I would very much like to use jsoo and not BS, and do not really care for the size of the produced js file).

Essentially it is very easy to run into a stack overflow, when using markup compiled to js. I suspect this is due to the CPS used in the implementation, which I assume is hard for jsoo to optimize into tail-recursion.

One solution I see would be to manually trampoline in markup, which is a somewhat invasive change...

Repro case: https://paste.isomorphis.me/O8j

aantron commented 7 years ago

Well, Markup already abuses some references internally. The long-term roadmap for the library was to get it really really well-tested, then replace the innards with fully imperative code, or whatever is necessary to get desired performance characteristics (or non-stack-busting characteristics, as the case may be). Just, nobody has demanded this to date :)

Taking a look at the repro case...

aantron commented 7 years ago

Alright, so it's confirmed related almost certainly to jsoo. I expanded the lipsum text to about 11000 lines, and that ran just fine when compiled as bytecode and run natively. By comparison, the original few lines, run through jsoo, do break the stack in Node.js.

aantron commented 7 years ago

I've opened an issue for this, #26. I don't know that much about how js_of_ocaml would interact with CPS, but I guess it's time to learn.