MachinePublishers / jBrowserDriver

A programmable, embeddable web browser driver compatible with the Selenium WebDriver spec -- headless, WebKit-based, pure Java
Other
809 stars 143 forks source link

Rendering saved HTML #192

Closed smadha closed 7 years ago

smadha commented 7 years ago

Hi, We are trying to make a crawler plugin which saves HTML after executing javascript on that web page. I tried using jbrowser and it works well if I provide a URL to crawl. We also want to render saved HTML using a headless browser. Is it possible using jBrowserDriver?

I tried passing filesystem path to html file but that didn't work. Best option for us will be if we can pass saved HTML along with URL in JBrowserDriver.get(String url) function

driver.get("http://www.example.com","<HTML>saved html content</HTML>");

This seems to be a crawler specific feature but I think it can be very useful in mocking webpages too. I am willing to work on this provided little guidance. (if there is no workaround)

smadha commented 7 years ago

If we can override JBrowserDriverServer.get(final String url) and use javafx.scene.web.WebEngine.loadContent(String content) instead of javafx.scene.web.WebEngine.load(String url) on line L296 it could be a great start.

hollingsworthd commented 7 years ago

Interesting. This isn't part of the Selenium APIs really. Not sure about adding such functionality to this project. Perhaps executeScript could be used to write HTML via JavaScript, after visiting about:blank.

Also there are some nice light weight ways to run a web server. Look into this Python command: python -m SimpleHTTPServer 8000 , it can turn any local directory into a website.

smadha commented 7 years ago

@hollingsworthd - Thanks for replying

I like the idea of visiting about:blank and setting HTML through javascript. Only things that will lack here will be setting correct URL in browser for completing paths to resources. For example a script tag might have relative path that needs to resolved.

<link href="css/app.min.css" rel="stylesheet">
Resolved to -> http://example.com/css/app.min.css

We can try editing HTML and resolving all the relative paths before passing it to jBroserDriver but still there could be problem in loading cross domain ajax calls. Please correct me if I am wrong about this and jBroserDriver will not distinguish cross domain ajax. Is there a way I can manipulate URL through code?

I thought of hosting static webpages but that's hard to scale for a crawler. I could try creating a mock server and injecting it somehow in jBrowserDriver but even then it's hard to mimic exact browser behavior. Also I could not find a way to inject. That's a good idea though

hollingsworthd commented 7 years ago

The website domains would indeed be an issue. Not sure of an easy way around that aside from a hosts file entry. There may be xss protections built into the browser--not sure off the top of my head.

I think enhancing jBrowserDriver to support this is probably beyond the scope of the project and would be easier in the end to have a separate component that mocks a server. Aside from the python server, if you want a way to programmatically do it in Java, maybe try something like http://sparkjava.com/ in conjunction with writing to the OS hosts file.

smadha commented 7 years ago

@hollingsworthd - I have been playing around with jbrowser a little and so far it serves the purpose just that it's little slow. I tried enabling quickRender and headless I also use ajaxResourceTimeout and ajaxWait. Is there a way can make it even fast? All I need is rendered HTML and I can skip styling too. Can you suggest on below-

  1. How can I parallelize loading multiple pages in fast way? As of now I create new instance of JBrowserDriver every time for a new page and quit it after I get source HTML.
  2. Can we add a custom filter list in settings and skip all URL with that filter like we do for media in com.machinepublishers.jbrowserdriver.ResponseHandler.getBody (I can work on a PR)

Thanks a lot for helping us out.