awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

PyDeeQu fails to function on Google Colab #26

Closed MikeFreiberger closed 3 years ago

MikeFreiberger commented 3 years ago

When running PyDeeQu tutorial in the readme.md on Google Colab environment, cell fails with output:
+------+--------+----+-----+ |entity|instance|name|value| +------+--------+----+-----+ +------+--------+----+-----+

To Reproduce Code ran from the readme.md:

from pyspark.sql import SparkSession, Row import pydeequ

spark = (SparkSession .builder .config("spark.jars.packages", pydeequ.deequ_maven_coord) .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

df = spark.sparkContext.parallelize([ Row(a="foo", b=1, c=5), Row(a="bar", b=2, c=6), Row(a="baz", b=3, c=None)]).toDF()

from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \ .onData(df) \ .addAnalyzer(Size()) \ .addAnalyzer(Completeness("b")) \ .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult) analysisResult_df.show()

Expected behavior I expect the suggestion in the tutorial in the readme.md to not fail. I expect this output: +-------+--------+------------+-----+ | entity|instance| name|value| +-------+--------+------------+-----+ |Dataset| *| Size| 3.0| | Column| b|Completeness| 1.0| +-------+--------+------------+-----+

Colab environment accessed via Chrome browser: spark.version = 2.4.7 Python 3.7.10 !java -version = openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.18.04) OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)

If PyDeeQu is not supported on Colab since it is a Google platform, please just close this issue.

MikeFreiberger commented 3 years ago

I got it to work. !apt-get install openjdk-8-jdk-headless -qq

> /dev/null

!wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz !tar xf spark-2.4.7-bin-hadoop2.7.tgz !pip install -q findspark import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7" import findspark findspark.init() from pyspark.sql import SparkSession !pip install pydeequ from pyspark.sql import SparkSession, Row import pydeequ

spark = SparkSession.builder\ .master("local")\ .appName("Colab")\ .config('spark.ui.port', '4050')\ .getOrCreate()

df = spark.sparkContext.parallelize([ Row(a="foo", b=1, c=5), Row(a="bar", b=2, c=6), Row(a="baz", b=3, c=None)]).toDF() from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \ .onData(df) \ .addAnalyzer(Size()) \ .addAnalyzer(Completeness("b")) \ .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult) analysisResult_df.show()

Starignus commented 3 years ago

I followed your solution above, but I still get: TypeError: 'JavaPackage' object is not callable when running the analysisResult

MikeFreiberger commented 3 years ago

Sorry, I just double checked my Colab NB and found this section too that I hadn't posted. Seems like your error and this bit of code might turn the trick:

!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null !update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java !java -version

MikeFreiberger commented 3 years ago

I just checked my Colab NB, and I had to change these lines too to get it to work due to 2.4.0 being the last version to support scala 2.11 that I could find: !wget -q https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz !tar xf spark-2.4.0-bin-hadoop2.7.tgz os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"

and had to add os.environ["SPARK_VERSION"] = "2.4.0"

Starignus commented 3 years ago

I still got the same error when running with the changes you suggested. You can see the notebook here https://colab.research.google.com/drive/1L_ASFpu28KVqjrAt-uH2oc-we3D_2SR8?usp=sharing

MikeFreiberger commented 3 years ago

OK, I got it and you need to update the cell setting up the spark session in your shared NB. SORRY, another undocumented thing I did: spark = SparkSession.builder\ .master("local")\ .appName("Colab")\ .config('spark.ui.port', '4050')\ .config("spark.jars.packages", pydeequ.deequ_maven_coord)\ .config("spark.jars.excludes", pydeequ.f2j_maven_coord)\ .getOrCreate() conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '8g'), ('spark.driver.memory','8g')])

Starignus commented 3 years ago

Now It is working, thanks for the help! I added the examples from the README of the repo in my Colabo notebook. I added here the code to have all that is needed to work with Pydeequ in a Colab notebook.

# Updating system
!apt-get update

# Installing Java
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version

# Getting Hadoop version 
!wget  https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark

# Install Pydeequ
!pip install pydeequ

# Setting env variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"
os.environ["SPARK_VERSION"] = "2.4.0"

import findspark
findspark.init()

# importing Spark and pydeequ
from pyspark.sql import SparkSession, Row
import pydeequ

# Creating Spark session
spark = SparkSession.builder
.master("local")
.appName("Colab")
.config('spark.ui.port', '4050')
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate()
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '8g'), ('spark.driver.memory','8g')])