awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Checks.hasMaxLength "method doesn't exist" error #166

Closed peixeirodata closed 2 months ago

peixeirodata commented 8 months ago

Describe the bug Seems that hasMaxLength method from checks module has some implementation issue.

To Reproduce

  1. A demo spark dataframe was generated as follows:
    data = [(1, 'Alice', 34), (2, 'Bob', 36), (3, 'Charlie', 30)]
    df = spark.createDataFrame(data, ['id', 'name', 'age'])
  2. Pydeequ modules imported:
    from pydeequ.checks import *
    from pydeequ.verification import *
  3. Check class instantiated and VerificationSuite ran:
check_length = Check(spark, CheckLevel.Warning, "Check if the column length")

checkResult = (
    VerificationSuite(spark)
    .onData(df)
    .addCheck(check_length.hasMaxLength(column = 'name', assertion = lambda x: x >= 3))
    .run())
  1. Main Py4JException raised:
    py4j.Py4JException: Method hasMaxLength([class java.lang.String, class com.sun.proxy.$Proxy93, class scala.None$]) does not exist
  2. Full traceback:
    
    ---------------------------------------------------------------------------
    Py4JError                                 Traceback (most recent call last)
    <command-2160774200945294> in <cell line: 7>()
      7     VerificationSuite(spark)
      8     .onData(df)
    ----> 9     .addCheck(check_length.hasMaxLength(column = 'name', assertion = lambda x: x >= 3))
     10     .run())
     11 

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydeequ/checks.py in hasMaxLength(self, column, assertion, hint) 424 assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) 425 hint = self._jvm.scala.Option.apply(hint) --> 426 self._Check = self._Check.hasMaxLength(column, assertion_func, hint) 427 return self 428

/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in call(self, *args) 1319 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1323

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, kw) 194 def deco(*a: Any, *kw: Any) -> Any: 195 try: --> 196 return f(a, kw) 197 except Py4JJavaError as e: 198 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 328 format(target_id, ".", name), value) 329 else: --> 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value))

Py4JError: An error occurred while calling o829.hasMaxLength. Trace: py4j.Py4JException: Method hasMaxLength([class java.lang.String, class com.sun.proxy.$Proxy93, class scala.None$]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:341) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:349) at py4j.Gateway.invoke(Gateway.java:297) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)



**Expected behavior**
The expected was just having en VerificationResult object.

**Additional context**
* environment: Databricks (AWS)
* pydeequ version: 1.1.1
* deequ version: com.amazon.deequ:deequ:2.0.4-spark-3.3
* databricks runtime: 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)
chenliu0831 commented 8 months ago

Indeed a bug, for a workaround you can give it an actual hint, e.g.

https://github.com/awslabs/python-deequ/blob/e74e974e739f24cccb827a6378cd50e97697a0e8/tests/test_checks.py#L987

Ashokgoa commented 8 months ago

Hi Giving hint also giving me the same error

Example: checkResult = VerificationSuite(spark).onData(df).addCheck(check.hasMaxLength('JobID',lambda x: x == 18,"Column JobID has 18 characters max")).run() @peixeirodata is it working for you?

chenliu0831 commented 8 months ago

I'm running locally with the same environment and it worked fine - this may have something to do with Databricks runtime. see code -

    data = [(1, 'Alice', 34), (2, 'Bob', 36), (3, 'Charlie', 30)]
    df = spark.createDataFrame(data, ['id', 'name', 'age'])

    check_length = Check(spark, CheckLevel.Warning, "Check if the column length")

    checkResult = (
        VerificationSuite(spark).onData(df).addCheck(check_length.hasMaxLength(column='name', assertion=lambda x: x >= 3)).run())

    check_result_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
    check_result_df.show()

Output:

+--------------------+-----------+------------+--------------------+-----------------+------------------+
|               check|check_level|check_status|          constraint|constraint_status|constraint_message|
+--------------------+-----------+------------+--------------------+-----------------+------------------+
|Check if the colu...|    Warning|     Success|MaxLengthConstrai...|          Success|                  |
+--------------------+-----------+------------+--------------------+-----------------+------------------+
peixeirodata commented 8 months ago

Hey guys. I just tested here again first in the same databricks environment, after that locally with jupyter lab. Same error was raised on both 😔

chenliu0831 commented 8 months ago

@peixeirodata could you share details how the jupyter lab environment is setup? Perhaps provide some reproducible Conda like steps.

My local environment is setup like https://github.com/awslabs/python-deequ/blob/master/.github/workflows/base.yml#L35-L39

Ashokgoa commented 8 months ago

In my case: 1)i run the code in pyspark-shell in a emr cluster with spark version 3.1 .hasMaxlenght works fine. pyspark --jars deequ-2.0.0-spark-3.1.jar. 2) but when i use pyspark version 3.3 in a differnet cluster with version 3.3(pyspark --jars deequ-2.0.4-spark-3.3.jar) i get the error method does not exist.

in both cases the difference is only jar and spark version which is 3.1 and 3.3 python version 3.7.16 and pydeequ 1.1.1 are same

chenliu0831 commented 8 months ago

This probably will have something to do with the API change in Deequ side during different versions. Generally Scala default parameters might be hard to deal with from Python, see https://github.com/awslabs/python-deequ/pull/168#issuecomment-1780329147. I will discuss internally to see the best path forward.

Note the error message is not clear thanks for the way Python - JVM bridge works - it basically means the interface does not entirely match

chenliu0831 commented 8 months ago

What's the SPARK_VERSION variable you are setting in the failure cases? @Ashokgoa @peixeirodata

peixeirodata commented 8 months ago

Sorry for my delay in replying guys. @chenliu0831 I'll check if I can provide you a conda file. About the variable:

SPARK_VERSION=3.3.2
chenliu0831 commented 8 months ago

Ok - I think I figured out why I cannot reproduce this. com.amazon.deequ:deequ:2.0.4-spark-3.3 wasn't a supported version yet due to breaking Scala API (optional parameter introduced in Scala). This has caused multiple issues and we have to discuss internally to fix it in Scala land.

If possible use deequ 2.0.3 which will work. https://github.com/awslabs/python-deequ/blob/master/pydeequ/configs.py#L8

Ashokgoa commented 8 months ago

Sorry i am off for few days ..Thanks @chenliu0831

chenliu0831 commented 2 months ago

This will be resolved in next release which will include https://github.com/awslabs/python-deequ/issues/169.