Open demurgos opened 9 months ago
I have the following input HTML file:
<html><body><div><a hr</div><div><div></div> <div><a href="/">bar</a></div></div></body></html>
Notice the unclosed <a tag (this is a minimal repro, in my case it's coming from an accidentally truncated DB value).
<a
If I open it in a browser (Firefox/Chrome) and print its DOM with document.getElementsByTagName("html")[0].outerHTML, I get:
document.getElementsByTagName("html")[0].outerHTML
<html><head></head><body> <div id="div0"> <a hr="" <="" div=""> </a><div id="div1"><a hr="" <="" div=""> <div id="div2"></div> </a><div id="div3"><a hr="" <="" div=""> </a><a href="/">bar</a> </div> </div> </body></html>
With scraper, if I parse it with Html::parse_document and print it with doc.root_element().html(), I get:
scraper
Html::parse_document
doc.root_element().html()
<html><head></head><body><div><a hr<="" div=""></a><div><a hr<="" div=""><div></div> </div> </div></body></html>
Notice that the anchor tag with text bar is missing!
bar
Running this input with html5ever's example sinks, I get an input close to browsers (but still not the same, see https://github.com/servo/html5ever/issues/512).
html5ever
It seems to indicate that there's an issue with scraper's TreeSink implementation.
TreeSink
I have the following input HTML file:
Notice the unclosed
<a
tag (this is a minimal repro, in my case it's coming from an accidentally truncated DB value).If I open it in a browser (Firefox/Chrome) and print its DOM with
document.getElementsByTagName("html")[0].outerHTML
, I get:With
scraper
, if I parse it withHtml::parse_document
and print it withdoc.root_element().html()
, I get:Notice that the anchor tag with text
bar
is missing!Running this input with
html5ever
's example sinks, I get an input close to browsers (but still not the same, see https://github.com/servo/html5ever/issues/512).It seems to indicate that there's an issue with scraper's
TreeSink
implementation.