awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

chore: upgrade Spark to 3.4, and Deequ to 2.0.5 #168

Closed chenliu0831 closed 4 months ago

chenliu0831 commented 8 months ago

Issue #, if available: https://github.com/awslabs/python-deequ/issues/151

Description of changes:

Upgrade Spark to 3.4 and Deequ to 2.0.5

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

chenliu0831 commented 8 months ago

The test failures are because some new optional parameters were introduced with the new version of Deequ (e.g. the analyzerOptions). Today the Python land cannot leverage the default parameters in Scala land... so it throws an error.

If the interface have to be an exact match, the code will bifurcate (since older version of Deequ won't have this parameter) e.g. below fixed the issue for deequ 2.0.5 but broken deequ <2.0.5

-        self._Check = self._Check.hasMaxLength(column, assertion_func, hint)
+        analyzer_options = self._jvm.scala.Option.apply(None)
+        self._Check = self._Check.hasMaxLength(column, assertion_func, hint, analyzer_options)

Test failures:

E  py4j.protocol.Py4JError: An error occurred while calling o86.hasMaxLength. Trace:
E py4j.Py4JException: Method hasMaxLength([class java.lang.String, class com.sun.proxy.$Proxy35, class scala.Some, class scala.None$]) does not exist
E at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321)
E at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:329)
E at py4j.Gateway.invoke(Gateway.java:274)
E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E at py4j.commands.CallCommand.execute(CallCommand.java:79)
E at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E at java.base/java.lang.Thread.run(Thread.java:829)

Edit: it seems hard even to call Scala defaults from Java... we might have to define multiple methods in Scala land without those default arguments

chenliu0831 commented 8 months ago

We have been busy with re:invent - expect some progress in Dec.

repcaks commented 7 months ago

Hi, is there a potential date for supporting Spark 3.4 ? :) Is it more December/January or even later ?

machadoluiz commented 6 months ago

Hello, @chenliu0831! Is there a expected date for supporting this Spark version? Or maybe 3.5?

katiesandford commented 6 months ago

Hi is there any update on this please?

anqini commented 6 months ago

Hi All, i created a new pull request to accommodate spark 3.4 version and deequ later than 2.0.3. Welcome to take a look. https://github.com/awslabs/python-deequ/pull/178

chenliu0831 commented 6 months ago

@anqini thanks so much for looking into this and submit the PR. Unfortunately, we cannot drop the support to older Spark/Deequ version yet. I will take a closer look in #178.

All - we have discussed with Deequ team and we will be working on a longer term solution including supporting plan for older Spark versions. There's no ETAs yet (some plan in Jan) but good news is we merged the maintainer groups from both repo. I will be looking into if we can have a safe short term solution in PyDeequ only this weekend.

dudumottavasconcelos commented 5 months ago

Hi, @chenliu0831! Any news on this upgrade?

MatheusXCH commented 5 months ago

Hello all! Any news on version upgrade?

katiesandford commented 4 months ago

Hi. Is there any update on this please?

chenliu0831 commented 4 months ago

Closing this for now, see my comments in https://github.com/awslabs/python-deequ/issues/192#issuecomment-1972385951 and we can provide updates there