Open markijbema opened 10 years ago
See also #7
So, the html posted above is not a valid HTML document. It's an incomplete fragment (no <body>
tag set, no enclosing </html>
). How should this be treated? Should it become a valid document, or be left as a fragment?
To clarify, if we tell nokogiri to treat it as a fragment (after telling it to parse as HTML instead of XML), we get the following:
<title>Hoi</title><script async defer src="http://foo.com/bar.js"></script><h1>yo</h1>
However, if we tell nokogiri to treat it as a document, we get this:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Hoi</title>
<script async defer src="http://foo.com/bar.js"></script>
</head>
<body><h1>yo</h1></body>
</html>
Which is correct?
I would say the latter. If we care about the exact bytes/chars, we can still use text. If we care about content/structure of html, the latter will result in more useful diffs (which is very useful in pullrequests)
I simplified the example I think. I'm not sure whether actual validity was an issue, and removing the html opening tag would make it valid html5
It's important to NOT change the output.
If I have an HTML fragment, I want to have standardised formatting (so :text is not useful in that case), which is why I introduced nokogiri in the first place (I tried using Tidy, but couldn't figure out how to get it running). This avoids spurious whitespace failures.
(Sorry it took so long to respond. I'm at a conference, and I find it very hard to keep track of issues/pull requests during conferences).
The first half of your comment contradicts the point about the spacing though. Spaces can be very significant in html, and changing it can break your app. Those changes would go undetected, but others wouldn't. I think we agree the only use of a formatter is to change the output. The question is the amount of change it should apply. To me, allowing changing spacing, but not other stuff seems a bit arbitrary.
I don't think it's wise to use the same formatter for html documents, and html fragments. I never assumed this formatter was aso used for html fragments. I'm not sure about the semantics of the XML parsers, but html fragments don't need to be valid xml documents (for instance, they don't need a root).
Regarding invalid documents there are still multiple strategies. Do you think this is an invalid strategie for valid documents? For invalid documents we could (for instance)
Spaces can be very significant in html, and changing it can break your app.
Ah, see this is where I fall down because I don't know front-end stuff! OK, cool. Thanks so much for clarifying.
I think that either of these strategies could work. @hotgazpacho suggested just outputting a warning. That might be the best thing to do since it's really up to the user what the real thing they're trying to verify is.
the following (valid) html:
becomes this: