Closed smatei closed 4 years ago
Added doctypes to page source and will be in next release.
Thanks for pointing this out! We usually get page source via the outerHtml method on the html dom node, which doesn't include doctype. In the cases where outerHtml isn't available (maybe it's not an html page or is too large and times out or some other failure scenarios I'm not remembering) we have a couple of fallback ways to get page source. Those fallbacks probably have the doctype.
Released in v1.1.0-RC2 available now on Maven Central
I used jbrowserdriver for this study
https://www.advancedwebranking.com/html/
And the doctype was missing in action. I got it with a static http client library. While scraping with jbrowserdriver I also noticed memory leaks if reusing the browser instance. I gave up reusing the same instance. I had to build a new instance for each page and then close it, but this cost me some time.
Hi,
I have noticed that for most of the webpages (I have scraped 8 millions), the <!DOCTYPE is not returned in the response code.
` code:
`
response
Is this on purpose? Is this a bug?