aantron / markup.ml

Error-recovering streaming HTML5 and XML parsers
http://aantron.github.io/markup.ml
MIT License
146 stars 16 forks source link

Displaced whitespaces for html/body. #32

Closed Drup closed 6 years ago

Drup commented 6 years ago

I discovered that one because it made tyxml's tests fail:

# open Markup 
# to_string @@ write_html @@ signals @@ parse_html @@ string 
    "  <html><head><title>foo</title></head><body></body></html>  "  ;;
- : string = "<html><head><title>foo</title></head><body>  </body></html>"

The whitespaces after html (which, according to the spec, are not relevant for parsing) are moved to inside the body (where, iirc, they are relevant).

Drup commented 6 years ago

Another similar one from the tyxml tests: "<html><head><title>foo</title></head> </html>" used to be parsed "<html><head><title>foo</title></head><body></body></html>" but is now parsed "<html><head><title>foo</title></head><body> </body></html>".

aantron commented 6 years ago

I believe Markup.ml is parsing correctly in the first example. The old, at first glance more reasonable behavior was fixed in #28 and #30.

For comparison, see the output of parse5: https://astexplorer.net/#/gist/a57dce947fc085c1c2a77f6f6190dd74/187cdfb9d84a6b227a9fa445c01c24e126665404

Assuming we are starting in between <body> and </body>, the parser mode will be "in body." The </body> token will change it to "after body," without popping the <body> element, and the </html> token to "after after body." At this point, the parser sees the whitespace, and it is supposed to process it using the rules of "in body," which eventually means adding the whitespace to the current node, which is still the <body> element because it was never popped.

Markup.ml looks wrong about the second example, looking into it.

Drup commented 6 years ago

Assuming we are starting in between and , the parser mode will be "in body." The token will change it to "after body," without popping the element, and the token to "after after body." At this point, the parser sees the whitespace, and it is supposed to process it using the rules of "in body," which eventually means adding the whitespace to the current node, which is still the element because it was never popped.

I admire your faithfulness to the specification, but that's literally insane.

I guess I'll have to change my tests. =')

aantron commented 6 years ago

Yeah, this is pretty much crazy. It looks like the HTML5 authors eventually gave up and decided to pretty much ignore </body> and </html>, i.e. the <body> continues until the end of document, and these "after body" modes exist only to tell people that their HTML is malformed before resuming processing the "body."

The only thing that actually pops <body> and <html> elements is the end of input, IIRC.

aantron commented 6 years ago

@Drup, I can't reproduce the other issue in master. I added it as a test (see above commit), but Markup.ml generates

<html><head><title>foo</title></head> <body></body></html>

which is correct w.r.t. the spec AFAICT.

Drup commented 6 years ago

hmm, I might have screwed up something on the way. It seems these bugs are mostly feature, and tyxml should respect them, so let's close!