USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Sparkler cannot be executed on Databricks because sparkContext not pulled from sparkSession #204

Closed mattvryan-github closed 3 years ago

mattvryan-github commented 3 years ago

Issue Description

When trying to run Sparkler on a databricks cluster it fails to see the worker nodes. This is because the way Databricks image sets up the spark environment the sparkContext must be pulled from the sparkSession.

How to reproduce it

Put sparkler fat jar, conf and plugin directories on the master node of a databricks cluster and try to crawl. You will get messages like: 2020-10-05 22:50:43 INFO Injector$:97 - Injecting 1 seeds 2020-10-05 22:50:47 WARN SparkContext:69 - Please ensure that the number of slots available on your executors is limited by the number of cores to task cpus and not another custom resource. If cores is not the limiting resource then dynamic allocation will not work properly! 2020-10-05 22:51:04 WARN TaskSchedulerImpl:69 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Environment and Version Information

Please indicate relevant versions, including, if relevant:

An external links for reference

https://docs.databricks.com/jobs.html

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it! pull request in process