awslabs / python-deequ

Python API for Deequ
Apache License 2.0
713 stars 134 forks source link

Can't execute ConstraintSuggestionRunner: Constructor com.amazon.deequ.suggestions.rules.CategoricalRangeRule([]) does not exist #70

Open lzhong1 opened 3 years ago

lzhong1 commented 3 years ago

I'm trying to run the ConstraintSuggestionRunner with the latest version of pyDeequ that supports Spark 3.1. I encountered the following error when I was running this code

from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

print(json.dumps(suggestionResult, indent=2))
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<command-3959492557277844> in <module>
      1 from pydeequ.suggestions import *
      2 
----> 3 suggestionResult = ConstraintSuggestionRunner(spark) \
      4              .onData(df) \
      5              .addConstraintRule(DEFAULT()) \

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5e8d3820-b76d-4b1b-ab52-2e507e080d3f/lib/python3.8/site-packages/pydeequ/suggestions.py in addConstraintRule(self, constraintRule)
     64             for rule in constraintRule_jvm:
     65                 rule._set_jvm(self._jvm)
---> 66                 rule_jvm = rule.rule_jvm
     67                 self._ConstraintSuggestionRunBuilder.addConstraintRule(rule_jvm)
     68 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5e8d3820-b76d-4b1b-ab52-2e507e080d3f/lib/python3.8/site-packages/pydeequ/suggestions.py in rule_jvm(self)
    184     @property
    185     def rule_jvm(self):
--> 186         return self._deequSuggestions.rules.CategoricalRangeRule()
    187 
    188 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1566 
   1567         answer = self._gateway_client.send_command(command)
-> 1568         return_value = get_return_value(
   1569             answer, self._gateway_client, None, self._fqn)
   1570 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    108     def deco(*a, **kw):
    109         try:
--> 110             return f(*a, **kw)
    111         except py4j.protocol.Py4JJavaError as e:
    112             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    328                     format(target_id, ".", name), value)
    329             else:
--> 330                 raise Py4JError(
    331                     "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332                     format(target_id, ".", name, value))

Py4JError: An error occurred while calling None.com.amazon.deequ.suggestions.rules.CategoricalRangeRule. Trace:
py4j.Py4JException: Constructor com.amazon.deequ.suggestions.rules.CategoricalRangeRule([]) does not exist
    at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202)
    at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:219)
    at py4j.Gateway.invoke(Gateway.java:248)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)

Additional information:

  1. I'm working with Spark 3.1, to cooperate with the version, I used the pyDeequ package as instructed here. Everything else is exactly the same as written in the tutorial
    
    from pyspark.sql import SparkSession, Row, DataFrame
    import json
    import pandas as pd
    import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession .builder .config("spark.driver.extraClassPath", classpath) .config("spark.jars.packages", 'deequ-2.0.0-spark-3.1.jar') # this is where i changed .config("spark.jars.excludes", pydeequ.f2j_maven_coord) .getOrCreate())


2. I tried other functions like VerificationSuite() and AnalysisRunner(), they both work fine.

Is it bc of the version I'm running that haven't supported this specific functionality in Spark 3.1 yet? I think I can avoid this 
by downgrading Spark version but I would not do that until no other solutions. Any insights would be really appreciated! 
jpugliesi commented 2 years ago

I am also seeing this issue, using Spark 3.1, pydeequ version 1.0.1

rghv404 commented 2 years ago

Attempting to bump this as I'm seeing the same error when trying to use pydeequ with Spark 3.1.x.

neylsoncrepalde commented 2 years ago

Hi everyone. I have the same error and it seems that something is wrong with the Categorical Suggestion Rules functions. If I run

from pydeequ.suggestions import *

suggestionResult = (
    ConstraintSuggestionRunner(spark)
    .onData(df)
    .addConstraintRule(CompleteIfCompleteRule())
    .addConstraintRule(NonNegativeNumbersRule())
    .addConstraintRule(RetainCompletenessRule())
    .addConstraintRule(RetainTypeRule())
    .addConstraintRule(UniqueIfApproximatelyUniqueRule())
    .run()
)

all is well. However if I run

suggestionResult = (
    ConstraintSuggestionRunner(spark)
    .onData(df)
    .addConstraintRule(CategoricalRangeRule())
    .run()
)

or

suggestionResult = (
    ConstraintSuggestionRunner(spark)
    .onData(df)
    .addConstraintRule(FractionalCategoricalRangeRule())
    .run()
)

I get the same error described by @lzhong1, each one for the respective function ran. Do you guys have any intention on working on this?

I am running the tests locally on a Mac M1, Pyspark 3.1.2, Scala 2.12.10. My spark-submit is like this:

spark-submit \
--packages com.amazon.deequ:deequ:2.0.0-spark-3.1 \
--exclude-packages net.sourceforge.f2j:arpack_combined_all \
pydeequ_test.py

Many thanks!

TiansuYu commented 2 years ago

+1

cryptexis commented 2 years ago

The problem comes from here: https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/suggestions/rules/CategoricalRangeRule.scala#L29

This is CategoricalRangeRule and FractionalCategoricalRangeRule are the only 2 rules that accept arguments to the constructor.

And according to this: https://github.com/awslabs/python-deequ/blob/master/pydeequ/suggestions.py#L169 DEFAULT function creates objects without passing arguments. I think, if one correctly calls them separately it will work. However, this should be somehow fixed, I agree