Closed Drup closed 6 years ago
Another similar one from the tyxml tests: "<html><head><title>foo</title></head> </html>"
used to be parsed "<html><head><title>foo</title></head><body></body></html>"
but is now parsed "<html><head><title>foo</title></head><body> </body></html>"
.
I believe Markup.ml is parsing correctly in the first example. The old, at first glance more reasonable behavior was fixed in #28 and #30.
For comparison, see the output of parse5: https://astexplorer.net/#/gist/a57dce947fc085c1c2a77f6f6190dd74/187cdfb9d84a6b227a9fa445c01c24e126665404
Assuming we are starting in between <body>
and </body>
, the parser mode will be "in body." The </body>
token will change it to "after body," without popping the <body>
element, and the </html>
token to "after after body." At this point, the parser sees the whitespace, and it is supposed to process it using the rules of "in body," which eventually means adding the whitespace to the current node, which is still the <body>
element because it was never popped.
Markup.ml looks wrong about the second example, looking into it.
Assuming we are starting in between
and , the parser mode will be "in body." The
token will change it to "after body," without popping the
I discovered that one because it made tyxml's tests fail:
The whitespaces after html (which, according to the spec, are not relevant for parsing) are moved to inside the body (where, iirc, they are relevant).