USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Changes so sparkler can be launched inside of a Databricks cluster #205

Closed mattvryan-github closed 3 years ago

mattvryan-github commented 3 years ago

What changes were proposed in this pull request?

Changes so Sparkler can be optionally configured to run in a Databricks spark environment

Is this related to an already existing issue on sparkler?

204

Will it close an existing issue?

204

How was this patch tested?

The resulting fat jar zipped up with the conf and plugin directories and copied up to the databricks file system (dbfs). Then scripted to be pulled onto Master node of a cluster, unzipped and executed. Sample crawls and scraps where performed that persisted results in a standalone EC2 Solr server. Then pulled from Solr via rest api.

Please review https://github.com/USCDataScience/sparkler/blob/master/.github/CONTRIBUTING.md before opening a pull request.

mattvryan-github commented 3 years ago

Documentation to follow

buggtb commented 3 years ago

Epic, thanks @mattvryan-github !