Closed peixeirodata closed 2 months ago
Indeed a bug, for a workaround you can give it an actual hint, e.g.
Hi Giving hint also giving me the same error
Example: checkResult = VerificationSuite(spark).onData(df).addCheck(check.hasMaxLength('JobID',lambda x: x == 18,"Column JobID has 18 characters max")).run() @peixeirodata is it working for you?
I'm running locally with the same environment and it worked fine - this may have something to do with Databricks runtime. see code -
data = [(1, 'Alice', 34), (2, 'Bob', 36), (3, 'Charlie', 30)]
df = spark.createDataFrame(data, ['id', 'name', 'age'])
check_length = Check(spark, CheckLevel.Warning, "Check if the column length")
checkResult = (
VerificationSuite(spark).onData(df).addCheck(check_length.hasMaxLength(column='name', assertion=lambda x: x >= 3)).run())
check_result_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
check_result_df.show()
Output:
+--------------------+-----------+------------+--------------------+-----------------+------------------+
| check|check_level|check_status| constraint|constraint_status|constraint_message|
+--------------------+-----------+------------+--------------------+-----------------+------------------+
|Check if the colu...| Warning| Success|MaxLengthConstrai...| Success| |
+--------------------+-----------+------------+--------------------+-----------------+------------------+
Hey guys. I just tested here again first in the same databricks environment, after that locally with jupyter lab. Same error was raised on both 😔
@peixeirodata could you share details how the jupyter lab environment is setup? Perhaps provide some reproducible Conda like steps.
My local environment is setup like https://github.com/awslabs/python-deequ/blob/master/.github/workflows/base.yml#L35-L39
In my case: 1)i run the code in pyspark-shell in a emr cluster with spark version 3.1 .hasMaxlenght works fine. pyspark --jars deequ-2.0.0-spark-3.1.jar. 2) but when i use pyspark version 3.3 in a differnet cluster with version 3.3(pyspark --jars deequ-2.0.4-spark-3.3.jar) i get the error method does not exist.
in both cases the difference is only jar and spark version which is 3.1 and 3.3 python version 3.7.16 and pydeequ 1.1.1 are same
This probably will have something to do with the API change in Deequ side during different versions. Generally Scala default parameters might be hard to deal with from Python, see https://github.com/awslabs/python-deequ/pull/168#issuecomment-1780329147. I will discuss internally to see the best path forward.
Note the error message is not clear thanks for the way Python - JVM bridge works - it basically means the interface does not entirely match
What's the SPARK_VERSION
variable you are setting in the failure cases? @Ashokgoa @peixeirodata
Sorry for my delay in replying guys. @chenliu0831 I'll check if I can provide you a conda file. About the variable:
SPARK_VERSION=3.3.2
Ok - I think I figured out why I cannot reproduce this. com.amazon.deequ:deequ:2.0.4-spark-3.3
wasn't a supported version yet due to breaking Scala API (optional parameter introduced in Scala). This has caused multiple issues and we have to discuss internally to fix it in Scala land.
If possible use deequ 2.0.3 which will work. https://github.com/awslabs/python-deequ/blob/master/pydeequ/configs.py#L8
Sorry i am off for few days ..Thanks @chenliu0831
This will be resolved in next release which will include https://github.com/awslabs/python-deequ/issues/169.
Describe the bug Seems that hasMaxLength method from checks module has some implementation issue.
To Reproduce
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pydeequ/checks.py in hasMaxLength(self, column, assertion, hint) 424 assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) 425 hint = self._jvm.scala.Option.apply(hint) --> 426 self._Check = self._Check.hasMaxLength(column, assertion_func, hint) 427 return self 428
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in call(self, *args) 1319 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, kw) 194 def deco(*a: Any, *kw: Any) -> Any: 195 try: --> 196 return f(a, kw) 197 except Py4JJavaError as e: 198 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 328 format(target_id, ".", name), value) 329 else: --> 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value))
Py4JError: An error occurred while calling o829.hasMaxLength. Trace: py4j.Py4JException: Method hasMaxLength([class java.lang.String, class com.sun.proxy.$Proxy93, class scala.None$]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:341) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:349) at py4j.Gateway.invoke(Gateway.java:297) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)