Closed jnioche closed 7 years ago
Use jBrowserDriver? 100% Java and headless.
I think jBrowserDriver required Java 8 - would that be an issue?
Also, in the past we used HTMLUnit, though not without challenges.
@kkrugler could put that in a separate repo so that the requirement for Java 8 does not become necessary for core and the other modules.
Nutch has a HTMLUnit-based protocol implementation I think but not sure it's been used much yet and I haven't heard on that. There's also a Selenium one.
Maybe Geb?
It's very easy to use and based on Selenium WebDriver which means it supports all browser that have a Driver implementation. It would mean that users could theoretically decide if they want to do headless (e.g. HtmlUnitDriver, PhantomJSDriver), go with a real browser or to use Selenium Grid with a variety of different browsers.
I did some very intensive integration testing with Geb (including waiting for AJAX responses etc.) and it is absolutely awesome. Would be easy to let the user provide Groovy/Geb scripts that are executed against Page context that is currently being crawled but I have no Idea how this could work with the Protocol Interface.
Hi @jnioche - curious to know if the current version of stormcrawler supports this Ajax/Dynamic content parsing?
Thanks Raj
Hi @raaz1234, see branch https://github.com/DigitalPebble/storm-crawler/tree/jBrowserDriver. Not yet merged but please give it a try
Hi @jnioche - is it just a case of configuring http.protocol.implementation to use the JBrowserProtocol? Or is more needed to make this work?
Hi @owenrh (am sitting at your desk, will try not to leave crumbs). Yes, should be just that indeed!
@owenrh please have a look at #457
@jnioche ha, thanks for the msgs, had an error on my inbox filters so I missed them. Will check it out, ta.
This should allow us to deal with the dynamic content. See discussion #142 Ideally we'd want to be able to have actions/navigations either programmatically or via configuration.
We could use :