fgnass / domino

Server-side DOM implementation based on Mozilla's dom.js
BSD 2-Clause "Simplified" License
769 stars 120 forks source link

HTMLParser doesn’t preserve all whitespace. #49

Open michaelrhodes opened 10 years ago

michaelrhodes commented 10 years ago

First off, thank you so much for putting this module together; it’s exactly what I wanted jsdom to be! My one pedantic little issue is that when given a full document, the parser seems to discard the whitespace of its outermost elements. It also replaces the doctype declaration.

For example, if you pass this document…

<!doctype html>
<html>
  <head>
    <title>A title</title>
  </head>
  <body>
    <h1>A heading</h1>
  </body>
</html>

…into domino.createDocument…

var read = require('fs').readFileSync
var document = require('domino').createDocument
var html = read('./input.html')

console.log(document(html).outerHTML)

…the output is:

<!DOCTYPE html><html><head>
    <title>A title</title>
  </head>
  <body>
    <h1>A heading</h1>

</body></html>

I think it would be ideal if the output matched the input.

cscott commented 10 years ago

So, a couple of standards-related issues here:

1) document.outerHTML isn't an official interface. It's a domino-specific extension, because it's useful. But keep that in mind. document.documentElement.outerHTML is an standard API, but that only gives you the <html> element.

2) I need to check that the standard actually allows DOM Text Nodes in the places you want. My experiments seem to indicate that document.documentElement.outerHTML in browsers deletes the whitespace around the <body> and <head> elements. Can you find a browser where this preserves whitespace?

3) In your example it looks like there is extra whitespace at the end of the </body>, after the </h1>. Is that a typo in your report, or is that really happening? That's a separate bug if so.

michaelrhodes commented 10 years ago

Yeah, you’re right, Chrome 37, Firefox 30, Safari 6.1, and Opera 12.16 all remove the whitespace around <head> and <body>.

As for the extra whitespace before </body>, that is really happening, but it also seems to be the behaviour of browsers (at least Chrome). Try comparing the source of github.com with document.documentElement.outerHTML’s output for example.

cscott commented 10 years ago

So, all that said, lots of people use domino to parse, mutate, then serialize, so I wouldn't object to adding a new domino-specific api (maybe even tweaking document.outerHTML) to use hidden domino-specific properties to preserve more whitespace. But I think we need to keep the standard DOM APIs browser-compatible.