USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Support Java Script execution engine for web pages #11

Closed thammegowda closed 8 years ago

smadha commented 8 years ago

Are there any thoughts on this? Are we planning to use Splash / Selenium / phantomjs ?

https://github.com/scrapinghub/splash

thammegowda commented 8 years ago

My opinion: Priority(basic things) > Priority(fancy features)

My definition of basics:

In other words, I love to get some JS engine working within JVM. Thus we can distribute the load to all the nodes in the cluster. If that is not possible, the second option would be to proxy to an HTTP API (assuming the JS execution is still distributable behind the proxy).

@smadha Come on board, research about the options, and get us some working piece of code. Thanks!

smadha commented 8 years ago

37

thammegowda commented 8 years ago

resolved by #37