awslabs / python-deequ

Python API for Deequ
Apache License 2.0
713 stars 134 forks source link

"TypeError: 'JavaPackage' object is not callable" while running any Deequ components #190

Closed alinesacchetti closed 7 months ago

alinesacchetti commented 8 months ago

Describe the bug Currently our organization is trying to use PyDeequ libraries along with the Databricks which is using Apache Spark 3.3.2. When we try to call any function from pydeequ (AnalysisRunner, ColumnProfilerRunner, ConstraintSuggestionRunner, Check) we get the error "TypeError: 'JavaPackage' object is not callable"

To Reproduce Steps to reproduce the behavior:

  1. Databricks cluster: 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) using env vars:

and spark conf:

spark.driver.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true" spark.databricks.optimizer.adaptive.enabled true spark.databricks.delta.preview.enabled true spark.sql.adaptive.coalescePartitions.enabled true spark.sql.sources.partitionOverwriteMode dynamic spark.sql.adaptive.skewJoin.enabled true spark.databricks.unity.catalog.enable false spark.sql.execution.arrow.enabled true spark.executor.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"

  1. Install pydeequ==1.2.0
  2. Set up a PySpark session (https://pydeequ.readthedocs.io/en/latest/README.html#set-up-a-pyspark-session)
  3. See error when you try to run any component (for example: https://pydeequ.readthedocs.io/en/latest/README.html#analyzers)

Expected behavior We hope that we can validate our data by cheking the last version of our data in an pyspark dataframe

Screenshots image

Desktop (please complete the following information):

chenliu0831 commented 8 months ago

Try this workaround - https://github.com/awslabs/python-deequ/issues/138#issuecomment-1611575546?

alinesacchetti commented 8 months ago

Hi! I tried to use this commands, but still have the same issue: image

The weird part is that it doesn't matter which component I try to use, the error is always the same: "TypeError: 'JavaPackage' object is not callable"

chenliu0831 commented 8 months ago

On Spark clusters, you probably will have better luck putting the Deequ jar to the Spark runtime jars library path / class path. We don't have a DB environment but you could probably follow this post https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/. You can download the Deequ jar from https://mvnrepository.com/artifact/com.amazon.deequ/deequ/2.0.4-spark-3.3

chenliu0831 commented 7 months ago

Closing - feel free to re-open if you need more help.