MachinePublishers / jBrowserDriver

A programmable, embeddable web browser driver compatible with the Selenium WebDriver spec -- headless, WebKit-based, pure Java
Other
809 stars 143 forks source link

Crawing data : get call takes 15 secs #282

Open Ipseeta opened 7 years ago

Ipseeta commented 7 years ago

I have done the changes as per the documentation. val driver = JBrowserDriver(Settings.builder().timezone(Timezone.AMERICA_NEWYORK).userAgent(UserAgent.CHROME).build()) // takes 8-9 secs driver.get(url) // takes 6-7 secs Data is getting crawled but the time taken to get the driver takes about 15 secs. This happens with any kind of url not specific to one. Is there any way to reduce it? Please help me with this.

hollingsworthd commented 7 years ago

Thanks for the info. The 8-9 seconds is concerning. Should take more like 1-2 seconds at most (this is the browser initializing). The get URL taking 6-7 seconds is probably normal. If you are crawling and don't care about web pages executing javascript and doing ajax type things (i.e., you just want the static html content), then try Settings.builder.javascript(false).ajaxWait(0)

What language are you using? Scala, Kotlin... ? That might be relevant for the 8-9 second delay.

Ipseeta commented 7 years ago

@hollingsworthd This is Kotlin. But I also tried it using Java, the delay is same.

hollingsworthd commented 7 years ago

When a new instance of JBrowserDriver is created, it will start up another Java process so that if Java's browser crashes it does not bring down the main process. As part of this there are two main things which might contribute to excessive delays: (1) there are other instances of JBrowserDriver running and the next instance is waiting for them to complete and (2) inspecting the classpath and setting up dependencies for the child process to run.

Regarding (1) there is a configurable limit on the number of concurrently running instances. See http://machinepublishers.github.io/jBrowserDriver/com/machinepublishers/jbrowserdriver/Settings.Builder.html#processes-int- but it is set to a default of 2x the number of CPUs which is probably the most anyone would want. Instances of JBrowserDriver should run JBrowserDriver.quit() as soon as they are done working so that they free up the queue of instances waiting for CPU time.

Regarding (2) it is possible to workaround/debug these issues by using http://machinepublishers.github.io/jBrowserDriver/com/machinepublishers/jbrowserdriver/Settings.Builder.html#javaOptions-java.lang.String...- to specify the classpath manually and you could point to some sort of uber jar. There are some older and maybe closed issues for this project where this is discussed in depth. I can link to those later if needed.

hollingsworthd commented 7 years ago

Also regarding (2) you could specify the path to Java itself using http://machinepublishers.github.io/jBrowserDriver/com/machinepublishers/jbrowserdriver/Settings.Builder.html#javaBinary-java.lang.String-

This is useful stuff for debugging. It might point to the cause of this which could be addressed. Interested in hearing how this goes and figuring this out.

Ipseeta commented 7 years ago

Tried the first one, I was using driver.close() instead of driver.quit() Now changed to driver.quit(), killed all the java instances. 17:09:36.382 [http-nio-8080-exec-1] INFO extract.ExtractServiceImpl - Elapsed time for getting url using jbrowser in milliseconds: 10828 17:09:37.493 [http-nio-8080-exec-1] INFO extract.ExtractServiceImpl - Elapsed time for building logic using jbrowser in milliseconds: 1110 17:09:37.494 [http-nio-8080-exec-1] INFO extract.ExtractController - Final Elapsed time using jbrowser for url https://github.com/Ipseeta in milliseconds: 17970 Also changed to this Settings.builder().javascript(false).ajaxWait(0) will keep you posted after trying (2)

Ipseeta commented 7 years ago

Please find the log as per (2) :- log.txt

Let me know what can be the issue.

Ipseeta commented 6 years ago

Any luck?