databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

Use sparkdl on jupyter notebook, without web connection #214

Open christophelebrun opened 4 years ago

christophelebrun commented 4 years ago

Hello,

I am running a jupyter notebook on a EMR instance, without access to the web. I have downloaded the .jar file of sparkdl to an s3 bucket.

I tried :

# Creating SparkSession
spark = (SparkSession
            .builder
            .config('spark.jars', "s3://my_bucket/libs/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar")
            .getOrCreate()
)

This cell run without error.

But I got an error with from sparkdl import DeepImageFeaturizer ModuleNotFoundError: No module named 'sparkdl'

Any idea of how to fix that ?

spark-water commented 4 years ago

use

spark.jars.packages, instead of spark.jars. Also, I had no success using a local package (in your case, you compiled one and put in S3 bucket) due to lack of parent dependency. You should pull from databricks spark package site. I know, this would have limitations but so far I've not able to find a solution.