causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.79k stars 98 forks source link

Malformed HTML parsed differently from browsers #147

Open demurgos opened 9 months ago

demurgos commented 9 months ago

I have the following input HTML file:

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>

Notice the unclosed <a tag (this is a minimal repro, in my case it's coming from an accidentally truncated DB value).

If I open it in a browser (Firefox/Chrome) and print its DOM with document.getElementsByTagName("html")[0].outerHTML, I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>

With scraper, if I parse it with Html::parse_document and print it with doc.root_element().html(), I get:

<html><head></head><body><div><a hr<="" div=""></a><div><a hr<="" div=""><div></div>
</div>
</div></body></html>

Notice that the anchor tag with text bar is missing!

Running this input with html5ever's example sinks, I get an input close to browsers (but still not the same, see https://github.com/servo/html5ever/issues/512).

It seems to indicate that there's an issue with scraper's TreeSink implementation.