Closed yannham closed 7 years ago
Thanks. This is an internal error in Markup.ml that needs to be fixed.
This is due to wrong handling of an unmatched </form>
tag in the (ill-formed) HTML input.
I want to note that Markup.ml should not exactly fail quietly, more like report the bad tag to ~report
and then recover in a certain way – there is a specific behavior required by HTML5 (see 'An end tag whose tag name is "form"'), so I hesitate to call the correct behavior a failure.
This should be fixed now (in Markup.ml master
). Sorry about the delay – I actually wrote most of this commit back in March, but then I faced making a slightly ugly tradeoff due to the specification, which assumes a DOM-building parser, not being fully compatible with streaming parsing. While thinking about how to resolve that, I eventually got swamped by other work. See the commit message for some detail on what I chose – but it's ultimately just some comments on esoteric HTML error recovery behavior.
I tried to parse the github page of my project (https://github.com/yannham/mechaml/) using Lambdasoup, but I got an underlying unexpected error from Markup.ml. When I type in a REPL (utop)
Soup.read_file "github.html" |> Soup.parse
where github.html is a dump of the previously given github page, I get
While I expected Lambdasoup and Markup.ml to fail quietly
on invalid HTML5, or at least not to fail with an uncaught exception.
Here is a snapshot code of the incriminated version of the page