Selenium-based protocol implementation

apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

https://stormcrawler.apache.org/

Apache License 2.0

887 stars 262 forks source link

Selenium-based protocol implementation #144

Closed jnioche closed 7 years ago

jnioche commented 9 years ago

This should allow us to deal with the dynamic content. See discussion #142 Ideally we'd want to be able to have actions/navigations either programmatically or via configuration.

We could use :

selenium directly
or via crawljax
ghostDriver with PhantomJS
chromedriver via the Selenium API

jnioche commented 8 years ago

Use jBrowserDriver? 100% Java and headless.

kkrugler commented 8 years ago

I think jBrowserDriver required Java 8 - would that be an issue?

Also, in the past we used HTMLUnit, though not without challenges.

jnioche commented 8 years ago

@kkrugler could put that in a separate repo so that the requirement for Java 8 does not become necessary for core and the other modules.

Nutch has a HTMLUnit-based protocol implementation I think but not sure it's been used much yet and I haven't heard on that. There's also a Selenium one.

jnioche commented 7 years ago

Example of JS content

rkrombho commented 7 years ago

Maybe Geb?

It's very easy to use and based on Selenium WebDriver which means it supports all browser that have a Driver implementation. It would mean that users could theoretically decide if they want to do headless (e.g. HtmlUnitDriver, PhantomJSDriver), go with a real browser or to use Selenium Grid with a variety of different browsers.

I did some very intensive integration testing with Geb (including waiting for AJAX responses etc.) and it is absolutely awesome. Would be easy to let the user provide Groovy/Geb scripts that are executed against Page context that is currently being crawled but I have no Idea how this could work with the Protocol Interface.

iRajashekharC commented 7 years ago

Hi @jnioche - curious to know if the current version of stormcrawler supports this Ajax/Dynamic content parsing?

Thanks Raj

jnioche commented 7 years ago

Hi @raaz1234, see branch https://github.com/DigitalPebble/storm-crawler/tree/jBrowserDriver. Not yet merged but please give it a try

owenrh commented 7 years ago

Hi @jnioche - is it just a case of configuring http.protocol.implementation to use the JBrowserProtocol? Or is more needed to make this work?

jnioche commented 7 years ago

Hi @owenrh (am sitting at your desk, will try not to leave crumbs). Yes, should be just that indeed!

jnioche commented 7 years ago

@owenrh please have a look at #457

owenrh commented 7 years ago

@jnioche ha, thanks for the msgs, had an error on my inbox filters so I missed them. Will check it out, ta.

jnioche commented 7 years ago