databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
Apache License 2.0
1.99k stars 494 forks source link

Error while importing sparkdl in google colab #226

Open jai-dewani opened 4 years ago

jai-dewani commented 4 years ago

Here is the error call back while importing sparkdl

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-4a9be7b8a3d0> in <module>()
----> 1 import sparkdl

1 frames
/usr/local/lib/python3.6/dist-packages/sparkdl/image/ in <module>()
     24 # pyspark
---> 25 from pyspark import Row
     26 from pyspark import SparkContext
     27 from pyspark.sql.types import (BinaryType, IntegerType, StringType, StructField, StructType)

ModuleNotFoundError: No module named 'pyspark'

Spark version -> sparkdl-0.2.2

scook12 commented 4 years ago

Hey @jai-dewani this is expected behavior. Google colab's environment doesn't include all of spark's dependencies, including pyspark, hence the ModuleNotFoundError. You'll need to install these dependencies first.

This repo ( has an example of that, but it's a bit dated, so you might ask @asifahmed90 if you run into any issues. Good luck!

jai-dewani commented 4 years ago

Actually I did all the necessary steps from the start yet I am ending with this problem,
Here is the link to my collab notebook

While running the document, just run the first two subsections and you will end up with eh the same result. I am looking hard for any minor mistake I could be doing or something I missed out on, but can't seem to find something :/

Edit: A similar issue has been posted with the same problem

209 AttributeError: module 'sparkdl' has no attribute 'graph'

SaiNikhileshReddy commented 3 years ago

@jai-dewani, This setup worked for me.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

I have come around to that solution by looking into latest spark package distribution page. You can do the same by checking out and look out for latest version of spark and hadoop. Ex: spark-X.X.X/spark-X.X.X-bin-hadoopX.X.tgz.

Change these filenames in the above code as required.