awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Lambda arguments are not recognized (Sagemaker) #54

Open ml6cz opened 3 years ago

ml6cz commented 3 years ago

Describe the bug When passing a lambda function for an assertion for hasSize, hasMin, or hasMax, it results in a "Can't execute the assertion" error.

To Reproduce Steps to reproduce the behavior: Use any constraint that requires a lambda function

I used the one listed in the GitHub tutorials:

from pyspark.sql import SparkSession, Row, DataFrame import json import pandas as pd import sagemaker_pyspark from pydeequ.checks import from pydeequ.verification import import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

df = spark.read.parquet("s3a://amazon-reviews-pds/parquet/product_category=Electronics/") df.printSchema()

spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", pydeequ.deequ_maven_coord) .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())

check = Check(spark, CheckLevel.Warning, "Amazon Electronic Products Reviews")

checkResult = VerificationSuite(spark) \ .onData(df) \ .addCheck( check.hasSize(lambda x: x >= 3000000) \ .hasMin("star_rating", lambda x: x == 1.0) \ .hasMax("star_rating", lambda x: x == 5.0)) \ .run()

print(f"Verification Run Status: {checkResult.status}") checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult, pandas=True) checkResult_df

Expected behavior The following table should get all success values. image

Screenshots image

The major issue is the value in the dataframe "Can't execute the assertion: An exception was raised by the Python Proxy. Return Message: null! "

jaoanan1126 commented 3 years ago

Hi @ml6cz could you tell me more about your development environment?

meganwlin commented 3 years ago

@jaoanan1126 I am using a Pyspark (SparkMagic) kernel inside Sagemaker Notebook.