databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

Error while importing sparkdl in google colab #226

Open jai-dewani opened 4 years ago

jai-dewani commented 4 years ago

Here is the error call back while importing sparkdl

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-4a9be7b8a3d0> in <module>()
----> 1 import sparkdl

1 frames
/usr/local/lib/python3.6/dist-packages/sparkdl/image/imageIO.py in <module>()
     23 
     24 # pyspark
---> 25 from pyspark import Row
     26 from pyspark import SparkContext
     27 from pyspark.sql.types import (BinaryType, IntegerType, StringType, StructField, StructType)

ModuleNotFoundError: No module named 'pyspark'

Spark version -> sparkdl-0.2.2

scook12 commented 4 years ago

Hey @jai-dewani this is expected behavior. Google colab's environment doesn't include all of spark's dependencies, including pyspark, hence the ModuleNotFoundError. You'll need to install these dependencies first.

This repo (https://github.com/asifahmed90/pyspark-ML-in-Colab) has an example of that, but it's a bit dated, so you might ask @asifahmed90 if you run into any issues. Good luck!

jai-dewani commented 4 years ago

Actually I did all the necessary steps from the start yet I am ending with this problem,
Here is the link to my collab notebook https://colab.research.google.com/drive/1nYq-rv6MT78UaiQPcSaFT-PHpsgVBe7R?usp=sharing

While running the document, just run the first two subsections and you will end up with eh the same result. I am looking hard for any minor mistake I could be doing or something I missed out on, but can't seem to find something :/

Edit: A similar issue has been posted with the same problem

209 AttributeError: module 'sparkdl' has no attribute 'graph'

SaiNikhileshReddy commented 3 years ago

@jai-dewani, This setup worked for me.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

I have come around to that solution by looking into latest spark package distribution page. You can do the same by checking out https://downloads.apache.org/spark/ and look out for latest version of spark and hadoop. Ex: spark-X.X.X/spark-X.X.X-bin-hadoopX.X.tgz.

Change these filenames in the above code as required.