MachinePublishers / jBrowserDriver

A programmable, embeddable web browser driver compatible with the Selenium WebDriver spec -- headless, WebKit-based, pure Java
Other
809 stars 143 forks source link

<!DOCTYPE html> missing in response code. #345

Closed smatei closed 4 years ago

smatei commented 5 years ago

Hi,

I have noticed that for most of the webpages (I have scraped 8 millions), the <!DOCTYPE is not returned in the response code.

` code:

JBrowserDriver driver = new JBrowserDriver();

driver.navigate().to("https://www.bing.com/");

System.out.println(driver.getPageSource());

`

response

<html ...........

Is this on purpose? Is this a bug?

hollingsworthd commented 4 years ago

Added doctypes to page source and will be in next release.

Thanks for pointing this out! We usually get page source via the outerHtml method on the html dom node, which doesn't include doctype. In the cases where outerHtml isn't available (maybe it's not an html page or is too large and times out or some other failure scenarios I'm not remembering) we have a couple of fallback ways to get page source. Those fallbacks probably have the doctype.

hollingsworthd commented 4 years ago

Released in v1.1.0-RC2 available now on Maven Central

smatei commented 4 years ago

I used jbrowserdriver for this study

https://www.advancedwebranking.com/html/

And the doctype was missing in action. I got it with a static http client library. While scraping with jbrowserdriver I also noticed memory leaks if reusing the browser instance. I gave up reusing the same instance. I had to build a new instance for each page and then close it, but this cost me some time.