awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

TypeError: 'JavaPackage' object is not callable when running pydeequ #1

Closed NastasiaSaby closed 3 years ago

NastasiaSaby commented 3 years ago

Describe the bug I've got an exception when I try to run pydeequ: "TypeError: 'JavaPackage' object is not callable".

To Reproduce Steps to reproduce the behavior:

  1. pip install pydeequ==0.1.5
  2. Code:
from pyspark.sql import SparkSession, Row
import pydeequ

spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

df = spark.sparkContext.parallelize([
            Row(a="foo", b=1, c=5),
            Row(a="bar", b=2, c=6),
            Row(a="baz", b=3, c=None)]).toDF()

from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
  1. Execute the code above
  2. See error: TypeError: 'JavaPackage' object is not callable

Expected behavior I was expecting the results of the analyzer.

Screenshots If applicable, add screenshots to help explain your problem. image

Desktop (please complete the following information):

Additional context I'm running it on a Databricks cluster.

Thank you for your help.

cghyzel commented 3 years ago

@NastasiaSaby The problem that jumps out at me is that you are using Spark 3.0, which we don't support yet. The latest Spark version we support is 2.4.7.

We can't support Spark 3.0 until Deequ compiles to Scala 2.12 (which is required for Spark 3.0). This is an issue with Deequ as well. Notably, this person opening this issue had the same error as you.

NastasiaSaby commented 3 years ago

Thank you for your answer @cghyzel. I downgraded my cluster to Spark 2.4.5, Scala 2.11. Unfortunately, I'm still getting the exact same error.

NastasiaSaby commented 3 years ago

I'm using the Databricks platform. Is there a known limitation?

Mimetis commented 3 years ago

I get the same issue

NastasiaSaby commented 3 years ago

This is not linked to pyspark=3.0.1. I've got the issue with Spark 2.4.5 Scala 2.11.

gucciwang commented 3 years ago

We have not tested with databricks yet, but here is how you'd get started with an Amazon EMR cluster -- I presume there may be some overlap here! Copy and pasted below:

Your EMR cluster must be running Spark v2.4.6 in order to work with PyDeequ. Once you have a running cluster that has those components and a SageMaker notebook with the necessary permissions, you can configure a SparkSession object from the below template to connect to your cluster. If you need a refresher on how to connect a SageMaker Notebook to EMR, check out this AWS blogpost on using Sparkmagic.

Once you’re in the SageMaker Notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster.

%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
          "spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
          "spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
         }
}

Start your SparkSession object in a cell after the above configuration by running spark, then use the SparkContext (default named sc) to install PyDeequ onto your cluster like so

sc.install_pypi_package('pydeequ')
bballamudi commented 3 years ago

I'm trying to run the example from the tutorials folder from my local machine. I'm getting the same error as the OP.

Spark Version: 2.4.7 Scala 2.11.12 pydeequ==0.1.5 WSL Ubuntu 20.18

Please look at the image attached below

image

MOHACGCG commented 3 years ago

same issue on spark version 2.4.3. I'm using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?

MOHACGCG commented 3 years ago

same issue on spark version 2.4.3. I'm using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?

using pyspark --jars {PATH_TO_DEEQ_JAR} resolves this error for me, i think this should be added to the installation steps.

bballamudi commented 3 years ago

@MOHACGCG I'm trying to find the path to deequ jar so as to try your solution. I'm looking under the virtual env path: /lib/Python3.7/site-packages/pyspark/jars. However, I don't see any jar corresponding to deequ in there. So I'm considering downloading and copyting deequ jar under this path to be followed by your command referring to this location. Do you think there is a better alternative to this?

bballamudi commented 3 years ago

@MOHACGCG The earlier attempt did not work. Also, I think the fourth statement in the code below should do the trick by default in adding the jar files. Cause 'pydeequ.deequ_maven_coord' evaluates to 'com.amazon.deequ:deequ:1.0.3'. Could you shed some light into this @gucciwang ?

spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", pydeequ.deequ_maven_coord) .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

MOHACGCG commented 3 years ago

@MOHACGCG I'm trying to find the path to deequ jar so as to try your solution. I'm looking under the virtual env path: /lib/Python3.7/site-packages/pyspark/jars. However, I don't see any jar corresponding to deequ in there. So I'm considering downloading and copyting deequ jar under this path to be followed by your command referring to this location. Do you think there is a better alternative to this?

I just downloaded the jar from here and passed it on.

gucciwang commented 3 years ago

@bballamudi, you are correct!

At the beginning of each of our tutorials, we always start off with the following configuration when setting up our SparkSession. This leverages the maven coordinates of the Deequ jar, and subsequently excluding an outdated Fortran jar. By using the configuration we provide, the SparkSession will automatically fetch the jars from maven.

import pydeequ

import sagemaker_pyspark
from pyspark.sql import SparkSession, Row

classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

And if you're wondering why we also ref sagemaker_pyspark, it's just so that it allows our SparkSession to read in parquet files from S3. An example of it in action is with the PyDeequ Analyzers tutorial

bballamudi commented 3 years ago

Thanks @gucciwang for the insight. However, it was not working automatically as intended. Perhaps it was to do with my setup. I therefore followed @MOHACGCG instruction in the above comment and it works now. Kindly make note of this in the readme file in the interest of larger audience.

SerenaLin2020 commented 3 years ago

experiencing the same issue. solved by using pyspark --jars /path-to-the-jar/deequ-1.0.5.jar more info: python version: 3.7.9 spark version: 2.4.7 scala version: 2.13.4

gucciwang commented 3 years ago

@SerenaLin2020 We have not tested PyDeequ with deequ-1.0.5.jar, so some functionalities may be impaired. Please try with deequ-1.0.3.jar and keep us updated! 😄

preshen-goobiah commented 3 years ago

I had the same issue running deequ in a SageMaker PySpark Processing Job. Solved with @MOHACGCG suggestion.

  1. Download the jar from maven
  2. Pass to the SageMaker PySpark Processor
spark_processor.run(
    submit_app="./Code/pydeequ_example.py",
    submit_jars=["deequ-1.0.5.jar"],
    logs=False,
    wait=False
)
byronmamamoney commented 3 years ago

Hi, is support for Spark version 3.1.1 supported. I'm currently on AWS emr-6.3.0. Running the below command: import pydeequ pydeequ.deequ_maven_coord Returns: 'com.amazon.deequ:deequ:1.1.0_spark-2.4-scala-2.11'

I take this value (string) and add it to my .json config where all the custom spark config is maintained. Running my application py files I get the same issue as the folks above. I assume it is due to the support for the version of Spark.

Any timeline to the implementation for this?

Cheers

SerenaLin2020 commented 3 years ago

@SerenaLin2020 We have not tested PyDeequ with deequ-1.0.5.jar, so some functionalities may be impaired. Please try with deequ-1.0.3.jar and keep us updated! 😄

Tested PyDeequ with deequ-1.0.3.jar , pretty good with some basic metrics such as min, max, mean, compliance etc. Only found that pyspark scripts runs much slower than scala scripts(same logic and same data volume). Not sure if this is pyspark's issue or it's PyDeequ's issue.

vinura commented 3 years ago

@SerenaLin2020 Did you try this on a databricks cluster ?

SerenaLin2020 commented 3 years ago

@SerenaLin2020 Did you try this on a databricks cluster ?

No, I only run this on EMR cluster

vinura commented 3 years ago

I got it to working with databricks by directly installing the jar to the cluster.

anusha610 commented 3 years ago

I got it to working with databricks by directly installing the jar to the cluster.

Hi Vinura. Could you please let me know how did you do? I want to directly install the jar to the cluster

jinyang08 commented 3 years ago

@MOHACGCG @preshen-goobiah @bballamudi Can you explain how did you download the jar from maven? I found myself lost in trying to find the way to add the deequ jar to the jars folder.

vinura commented 3 years ago

I installed the following maven package directly instead of pydeequ.deequ_maven_coord

com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12

you need to check wthr they have an exact match for your cluster and add it as a maven package on the databricks cluster. @anusha610 if you are running it on locally(using dbconnect) use the spark object as follows,

spark = (SparkSession .builder .config("spark.jars.packages", 'com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12') .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

alokpotadar commented 2 years ago

I installed the following maven package directly instead of pydeequ.deequ_maven_coord

com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12

you need to check wthr they have an exact match for your cluster and add it as a maven package on the databricks cluster. @anusha610 if you are running it on locally(using dbconnect) use the spark object as follows,

spark = (SparkSession .builder .config("spark.jars.packages", 'com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12') .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

Hi @vinura , I'm still getting the exact same error even after using your method. Please help. I use Spark 3.0.1 and Scala 2.12

gbalachandra-takeda commented 2 years ago

@MOHACGCG @vinura - Could you please suggest the script changes for this fix?

Issue: Am facing the similar error using DataBricks with below pydeequ version Error: TypeError: 'JavaPackage' object is not callable

python version: 3.7.9
pyspark - 2.4.0
scala version: 2.13.4
- **pydeequ-1.0.1** version 

Tried: Downloaded the suggested Jars and uploaded to Databricks filestore and passed the same for spark session

import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", '/FileStore/jars/deequ_1_0_5.jar')
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

Attached screenshot of error

deequ_error

Could you please suggest with appropriate version, steps and scripts for data-bricks implementations

admin-excior commented 1 year ago

If anyone struggles with this. I found this worked for me:

  1. Download the deequ jar corresponding to the correct spark version
  2. In your spark-defaults.conf file add the path of your jar file your downloaded in step1 to the lines below
  3. spark.driver.extraClassPath, spark.executor.extraClassPath
ethanwicker commented 1 year ago

I was able to combine some of the above solutions and other solutions and get pydeeque working on Azure Databricks in a notebook. Here are the details, if helpful for anyone.

From this link, I downloaded the appropriate JAR file to match my Spark version. In my case, that was deequ_2_0_1_spark_3_2.jar. I then installed this file using the JAR type under Libraries in my cluster configurations.

The following then worked, ran in different cells.

%pip install pydeequ
%sh export SPARK_VERSION=3.2.1
df = spark.read.load("abfss://container-name@account.dfs.core.windows.net/path/to/data")
from pyspark.sql import SparkSession

import pydeequ

spark = (SparkSession
    .builder
    .getOrCreate())
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
mikelevvra commented 1 year ago

Thanks @gucciwang for the insight. However, it was not working automatically as intended. Perhaps it was to do with my setup. I therefore followed @MOHACGCG instruction in the above comment and it works now. Kindly make note of this in the readme file in the interest of larger audience.

Hello, how did you solve it?

I am running spark in a hadoop cluster, with the following config:

Spark 2.4.4 Scala 2.11.12 And I am creating the spark session like this:

spark = (SparkSession .builder .appName("PyDeeQu") \ .config("spark.jars", "PATH_TO/deequ-1.1.0_spark-2.4-scala-2.11.jar") \ .config("spark.jars.packages", pydeequ.deequ_maven_coord)\ .config("spark.jars.excludes", pydeequ.f2j_maven_coord)\ .config("spark.sql.execution.arrow.enabled", "true")\ .config("spark.sql.sources.partitionOverwriteMode","dynamic")\ .getOrCreate())

I keep getting this error: TypeError: 'JavaPackage' object is not callable, and I don't know why...

vinura commented 1 year ago

It works fine with following configuration,

Use https://mvnrepository.com/artifact/com.amazon.deequ/deequ to pick the deeque version and spark version for spark.jars.package

from pyspark.sql import SparkSession, DataFrame
import pydeequ

def create_spark():
    """Function to get Spark Configuration"""
    spark = (
        SparkSession.builder.config(
            "spark.jars.packages", "com.amazon.deequ:deequ:2.0.1-spark-3.2"
        )
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
        .getOrCreate()
    )
    return spark

Also, if you are using databricks, make sure that you install this to the cluster libraries maven packages.

image

Bhavanabuddy commented 1 year ago

@vinura , i followed above one . However i have to add two lines of code to state Spark version in "Azure databricks"

from pyspark.sql import SparkSession, DataFrame import os os.environ['SPARK_VERSION'] = '3.2' import pydeequ

spark = SparkSession.builder.config("spark.jars.packages", "com.amazon.deequ:deequ:2.0.1-spark-3.2").config("spark.jars.excludes", pydeequ.f2j_maven_coord).getOrCreate()

The below spark session worked for 3.2

from pyspark.sql import SparkSession, DataFrame import os os.environ['SPARK_VERSION'] = '3.2' import pydeequ

spark2 = SparkSession.builder.appName('xyz').getOrCreate()

vinura commented 1 year ago

@Bhavanabuddy I added the spark version when using databricks connect (IDE/pyCharm) but not when using databricks notebooks.

Either way, I hope this fixes all the problems.