aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
380 stars 31 forks source link

Failure when parsing invalid HTML #10

Closed yannham closed 7 years ago

yannham commented 7 years ago

I tried to parse the github page of my project (https://github.com/yannham/mechaml/) using Lambdasoup, but I got an underlying unexpected error from Markup.ml. When I type in a REPL (utop)

Soup.read_file "github.html" |> Soup.parse

where github.html is a dump of the previously given github page, I get

Exception: Failure "require_current_element: None"

While I expected Lambdasoup and Markup.ml to fail quietly

on invalid HTML5, or at least not to fail with an uncaught exception.

Here is a snapshot code of the incriminated version of the page

aantron commented 7 years ago

Thanks. This is an internal error in Markup.ml that needs to be fixed.

This is due to wrong handling of an unmatched </form> tag in the (ill-formed) HTML input.

I want to note that Markup.ml should not exactly fail quietly, more like report the bad tag to ~report and then recover in a certain way – there is a specific behavior required by HTML5 (see 'An end tag whose tag name is "form"'), so I hesitate to call the correct behavior a failure.

aantron commented 7 years ago

This should be fixed now (in Markup.ml master). Sorry about the delay – I actually wrote most of this commit back in March, but then I faced making a slightly ugly tradeoff due to the specification, which assumes a DOM-building parser, not being fully compatible with streaming parsing. While thinking about how to resolve that, I eventually got swamped by other work. See the commit message for some detail on what I chose – but it's ultimately just some comments on esoteric HTML error recovery behavior.