Open lzhong1 opened 3 years ago
I am also seeing this issue, using Spark 3.1, pydeequ version 1.0.1
Attempting to bump this as I'm seeing the same error when trying to use pydeequ with Spark 3.1.x.
Hi everyone. I have the same error and it seems that something is wrong with the Categorical
Suggestion Rules functions. If I run
from pydeequ.suggestions import *
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(df)
.addConstraintRule(CompleteIfCompleteRule())
.addConstraintRule(NonNegativeNumbersRule())
.addConstraintRule(RetainCompletenessRule())
.addConstraintRule(RetainTypeRule())
.addConstraintRule(UniqueIfApproximatelyUniqueRule())
.run()
)
all is well. However if I run
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(df)
.addConstraintRule(CategoricalRangeRule())
.run()
)
or
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(df)
.addConstraintRule(FractionalCategoricalRangeRule())
.run()
)
I get the same error described by @lzhong1, each one for the respective function ran. Do you guys have any intention on working on this?
I am running the tests locally on a Mac M1, Pyspark 3.1.2, Scala 2.12.10. My spark-submit is like this:
spark-submit \
--packages com.amazon.deequ:deequ:2.0.0-spark-3.1 \
--exclude-packages net.sourceforge.f2j:arpack_combined_all \
pydeequ_test.py
Many thanks!
+1
The problem comes from here: https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/suggestions/rules/CategoricalRangeRule.scala#L29
This is CategoricalRangeRule
and FractionalCategoricalRangeRule
are the only 2 rules that accept arguments to the constructor.
And according to this:
https://github.com/awslabs/python-deequ/blob/master/pydeequ/suggestions.py#L169
DEFAULT
function creates objects without passing arguments. I think, if one correctly calls them separately it will work.
However, this should be somehow fixed, I agree
I'm trying to run the ConstraintSuggestionRunner with the latest version of pyDeequ that supports Spark 3.1. I encountered the following error when I was running this code
Additional information:
import pydeequ
classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", 'deequ-2.0.0-spark-3.1.jar') # this is where i changed .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())