awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

'JavaPackage' object is not callable error when trying in AWS GLUE spark job. #17

Closed knadigatla closed 3 years ago

knadigatla commented 3 years ago

We are trying to use python-deequ in glue spark job with --additional-python-modules pydeequ==0.1.5 and code im trying to execute is below

import sys 
from awsglue.transforms import * 
from awsglue.utils import getResolvedOptions 
from pyspark.context import SparkContext 
from awsglue.context import GlueContext 
from awsglue.job import Job 

## @params: [JOB_NAME] 
args = getResolvedOptions(sys.argv, ['JOB_NAME']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
session = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 

df_electronics = session.read.parquet("s3://SOME_S3_BUCKET/media/amazon_customer_reviews_data/")

print(df_electronics.printSchema())

s3_write_path = "s3://SOME_S3_BUCKET/67e761e7-a3e1-4443-a80d-ea8e38e3cff5/temp/simple_metrics_tutorial.json"

import pydeequ
from pydeequ.repository import *

repository = FileSystemMetricsRepository(session, s3_write_path)

key_tags = {'tag': 'general_electronics'}
resultKey = ResultKey(session, ResultKey.current_milli_time(), key_tags)
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(session) \
                    .onData(df_electronics) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("review_id")) \
                    .addAnalyzer(ApproxCountDistinct("review_id")) \
                    .addAnalyzer(Mean("star_rating")) \
                    .addAnalyzer(Distinctness("customer_id")) \
                    .addAnalyzer(Correlation("helpful_votes","total_votes")) \
                    .addAnalyzer(ApproxQuantile("star_rating",.5)) \
                    .useRepository(repository) \
                    .saveOrAppendResult(resultKey) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(session, analysisResult)
analysisResult_df.show()
job.commit()

ERROR LOGS:

Traceback (most recent call last):
  File "/tmp/test_deequ_sparkjob", line 27, in <module>
    repository = FileSystemMetricsRepository(session, s3_write_path)
  File "/home/spark/.local/lib/python3.7/site-packages/pydeequ/repository.py", line 138, in __init__
    self.repository = self.deequFSmetRep(self._jspark_session, path)
TypeError: 'JavaPackage' object is not callable

we are using Glue Version 2.0.

MOHACGCG commented 3 years ago

you can either add the deeque jar to the spark jar path or use maven to download it as part of your glue code (see one of the examples in the tutorial folder):

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())
gucciwang commented 3 years ago

Apologies for the delay, but you can also follow this blogpost we recently published for running PyDeequ on AWS Glue!

https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/