Closed NastasiaSaby closed 3 years ago
@NastasiaSaby The problem that jumps out at me is that you are using Spark 3.0, which we don't support yet. The latest Spark version we support is 2.4.7.
We can't support Spark 3.0 until Deequ compiles to Scala 2.12 (which is required for Spark 3.0). This is an issue with Deequ as well. Notably, this person opening this issue had the same error as you.
Thank you for your answer @cghyzel. I downgraded my cluster to Spark 2.4.5, Scala 2.11. Unfortunately, I'm still getting the exact same error.
I'm using the Databricks platform. Is there a known limitation?
I get the same issue
This is not linked to pyspark=3.0.1. I've got the issue with Spark 2.4.5 Scala 2.11.
We have not tested with databricks yet, but here is how you'd get started with an Amazon EMR cluster -- I presume there may be some overlap here! Copy and pasted below:
Your EMR cluster must be running Spark v2.4.6 in order to work with PyDeequ. Once you have a running cluster that has those components and a SageMaker notebook with the necessary permissions, you can configure a SparkSession object from the below template to connect to your cluster. If you need a refresher on how to connect a SageMaker Notebook to EMR, check out this AWS blogpost on using Sparkmagic.
Once you’re in the SageMaker Notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
"spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
}
}
Start your SparkSession object in a cell after the above configuration by running spark
, then use the SparkContext (default named sc
) to install PyDeequ onto your cluster like so
sc.install_pypi_package('pydeequ')
I'm trying to run the example from the tutorials folder from my local machine. I'm getting the same error as the OP.
Spark Version: 2.4.7 Scala 2.11.12 pydeequ==0.1.5 WSL Ubuntu 20.18
Please look at the image attached below
same issue on spark version 2.4.3. I'm using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?
same issue on spark version 2.4.3. I'm using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?
using pyspark --jars {PATH_TO_DEEQ_JAR}
resolves this error for me, i think this should be added to the installation steps.
@MOHACGCG I'm trying to find the path to deequ jar so as to try your solution. I'm looking under the virtual env path: /lib/Python3.7/site-packages/pyspark/jars. However, I don't see any jar corresponding to deequ in there. So I'm considering downloading and copyting deequ jar under this path to be followed by your command referring to this location. Do you think there is a better alternative to this?
@MOHACGCG The earlier attempt did not work. Also, I think the fourth statement in the code below should do the trick by default in adding the jar files. Cause 'pydeequ.deequ_maven_coord' evaluates to 'com.amazon.deequ:deequ:1.0.3'. Could you shed some light into this @gucciwang ?
spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", pydeequ.deequ_maven_coord) .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())
@MOHACGCG I'm trying to find the path to deequ jar so as to try your solution. I'm looking under the virtual env path: /lib/Python3.7/site-packages/pyspark/jars. However, I don't see any jar corresponding to deequ in there. So I'm considering downloading and copyting deequ jar under this path to be followed by your command referring to this location. Do you think there is a better alternative to this?
I just downloaded the jar from here and passed it on.
@bballamudi, you are correct!
At the beginning of each of our tutorials, we always start off with the following configuration when setting up our SparkSession. This leverages the maven coordinates of the Deequ jar, and subsequently excluding an outdated Fortran jar. By using the configuration we provide, the SparkSession will automatically fetch the jars from maven.
import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
.builder
.config("spark.driver.extraClassPath", classpath)
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
And if you're wondering why we also ref sagemaker_pyspark
, it's just so that it allows our SparkSession to read in parquet files from S3. An example of it in action is with the PyDeequ Analyzers tutorial
Thanks @gucciwang for the insight. However, it was not working automatically as intended. Perhaps it was to do with my setup. I therefore followed @MOHACGCG instruction in the above comment and it works now. Kindly make note of this in the readme file in the interest of larger audience.
experiencing the same issue. solved by using pyspark --jars /path-to-the-jar/deequ-1.0.5.jar
more info:
python version: 3.7.9
spark version: 2.4.7
scala version: 2.13.4
@SerenaLin2020 We have not tested PyDeequ with deequ-1.0.5.jar
, so some functionalities may be impaired. Please try with deequ-1.0.3.jar
and keep us updated! 😄
Hi, is support for Spark version 3.1.1 supported. I'm currently on AWS emr-6.3.0. Running the below command: import pydeequ pydeequ.deequ_maven_coord Returns: 'com.amazon.deequ:deequ:1.1.0_spark-2.4-scala-2.11'
I take this value (string) and add it to my .json config where all the custom spark config is maintained. Running my application py files I get the same issue as the folks above. I assume it is due to the support for the version of Spark.
Any timeline to the implementation for this?
Cheers
@SerenaLin2020 We have not tested PyDeequ with
deequ-1.0.5.jar
, so some functionalities may be impaired. Please try withdeequ-1.0.3.jar
and keep us updated! 😄
Tested PyDeequ with deequ-1.0.3.jar
, pretty good with some basic metrics such as min, max, mean, compliance etc. Only found that pyspark scripts runs much slower than scala scripts(same logic and same data volume). Not sure if this is pyspark's issue or it's PyDeequ's issue.
@SerenaLin2020 Did you try this on a databricks cluster ?
@SerenaLin2020 Did you try this on a databricks cluster ?
No, I only run this on EMR cluster
I got it to working with databricks by directly installing the jar to the cluster.
I got it to working with databricks by directly installing the jar to the cluster.
Hi Vinura. Could you please let me know how did you do? I want to directly install the jar to the cluster
@MOHACGCG @preshen-goobiah @bballamudi Can you explain how did you download the jar from maven? I found myself lost in trying to find the way to add the deequ jar to the jars folder.
I installed the following maven package directly instead of pydeequ.deequ_maven_coord
com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12
you need to check wthr they have an exact match for your cluster and add it as a maven package on the databricks cluster. @anusha610 if you are running it on locally(using dbconnect) use the spark object as follows,
spark = (SparkSession .builder .config("spark.jars.packages", 'com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12') .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())
I installed the following maven package directly instead of pydeequ.deequ_maven_coord
com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12
you need to check wthr they have an exact match for your cluster and add it as a maven package on the databricks cluster. @anusha610 if you are running it on locally(using dbconnect) use the spark object as follows,
spark = (SparkSession .builder .config("spark.jars.packages", 'com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12') .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())
Hi @vinura , I'm still getting the exact same error even after using your method. Please help. I use Spark 3.0.1 and Scala 2.12
@MOHACGCG @vinura - Could you please suggest the script changes for this fix?
Issue: Am facing the similar error using DataBricks with below pydeequ version Error: TypeError: 'JavaPackage' object is not callable
python version: 3.7.9
pyspark - 2.4.0
scala version: 2.13.4
- **pydeequ-1.0.1** version
Tried: Downloaded the suggested Jars and uploaded to Databricks filestore and passed the same for spark session
import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
.builder
.config("spark.driver.extraClassPath", classpath)
.config("spark.jars.packages", '/FileStore/jars/deequ_1_0_5.jar')
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
Attached screenshot of error
Could you please suggest with appropriate version, steps and scripts for data-bricks implementations
If anyone struggles with this. I found this worked for me:
I was able to combine some of the above solutions and other solutions and get pydeeque working on Azure Databricks in a notebook. Here are the details, if helpful for anyone.
From this link, I downloaded the appropriate JAR file to match my Spark version. In my case, that was deequ_2_0_1_spark_3_2.jar
. I then installed this file using the JAR type under Libraries in my cluster configurations.
The following then worked, ran in different cells.
%pip install pydeequ
%sh export SPARK_VERSION=3.2.1
df = spark.read.load("abfss://container-name@account.dfs.core.windows.net/path/to/data")
from pyspark.sql import SparkSession
import pydeequ
spark = (SparkSession
.builder
.getOrCreate())
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("b")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
Thanks @gucciwang for the insight. However, it was not working automatically as intended. Perhaps it was to do with my setup. I therefore followed @MOHACGCG instruction in the above comment and it works now. Kindly make note of this in the readme file in the interest of larger audience.
Hello, how did you solve it?
I am running spark in a hadoop cluster, with the following config:
Spark 2.4.4 Scala 2.11.12 And I am creating the spark session like this:
spark = (SparkSession .builder .appName("PyDeeQu") \ .config("spark.jars", "PATH_TO/deequ-1.1.0_spark-2.4-scala-2.11.jar") \ .config("spark.jars.packages", pydeequ.deequ_maven_coord)\ .config("spark.jars.excludes", pydeequ.f2j_maven_coord)\ .config("spark.sql.execution.arrow.enabled", "true")\ .config("spark.sql.sources.partitionOverwriteMode","dynamic")\ .getOrCreate())
I keep getting this error: TypeError: 'JavaPackage' object is not callable, and I don't know why...
It works fine with following configuration,
Use https://mvnrepository.com/artifact/com.amazon.deequ/deequ
to pick the deeque version and spark version for spark.jars.package
from pyspark.sql import SparkSession, DataFrame
import pydeequ
def create_spark():
"""Function to get Spark Configuration"""
spark = (
SparkSession.builder.config(
"spark.jars.packages", "com.amazon.deequ:deequ:2.0.1-spark-3.2"
)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate()
)
return spark
Also, if you are using databricks, make sure that you install this to the cluster libraries maven packages.
@vinura , i followed above one . However i have to add two lines of code to state Spark version in "Azure databricks"
from pyspark.sql import SparkSession, DataFrame import os os.environ['SPARK_VERSION'] = '3.2' import pydeequ
from pyspark.sql import SparkSession, DataFrame import os os.environ['SPARK_VERSION'] = '3.2' import pydeequ
spark2 = SparkSession.builder.appName('xyz').getOrCreate()
@Bhavanabuddy I added the spark version when using databricks connect (IDE/pyCharm) but not when using databricks notebooks.
Either way, I hope this fixes all the problems.
Describe the bug I've got an exception when I try to run pydeequ: "TypeError: 'JavaPackage' object is not callable".
To Reproduce Steps to reproduce the behavior:
Expected behavior I was expecting the results of the analyzer.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context I'm running it on a Databricks cluster.
Thank you for your help.