awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

PyDeequ support to Apache Spark 3.4.0 (and ideally 3.5.0) #192

Open machadoluiz opened 4 months ago

machadoluiz commented 4 months ago

Is your feature request related to a problem? Please describe. I'm currently facing issues with the PyDeequ support to Apache Spark version 3.4.0, since it is impacting several projects in my organization that uses PyDeequ as a data quality tool. The problem arises because our EMR clusters are required to support the latest version releases, but since the release of emr-6.12.0, the support for Apache Spark 3.3.x has been dropped.

Describe the solution you'd like I would like PyDeequ to be updated to support Apache Spark 3.4.0 and ideally, also the most recent version 3.5.0. I would also like to understand the requirements for this support, such as whether there are any backwards compatibility requirements for PyDeequ, and whether it is necessary for all future PyDeequ versions to continue supporting all of the currently supported Spark and Deequ versions, or if there is scope for dropping support for some versions, as mentioned on #178.

Describe alternatives you've considered As an alternative, we have considered migrating to Great Expectations due to its active maintenance and large community. However, PyDeequ is still preferred due to its seamless integration with our internal PySpark library. The transition to a new tool would also require significant resources and time. Therefore, having PyDeequ support Apache Spark 3.4.0 and 3.5.0 would be the most beneficial solution for us.

Additional context It seems that Deequ is already supporting Apache Spark 3.4.0 (#505) and most recently 3.5.0 (#514).

chenliu0831 commented 4 months ago

We treat backward compatibility very seriously, as all AWS API or owned library does. Dropping support for EOL Spark version can be an option but it need a bit more research.

I don't think it's very hard to fix https://github.com/awslabs/python-deequ/issues/169 but the change should be made at Deequ Scala land (adding overloaded functions with the old parameters). We currently do not have a date.

chenliu0831 commented 4 months ago

As a workaround, you can set env var SPARK_VERSION=3.3 and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.

LucasSchelkes-BA commented 3 months ago

As a workaround, you can set env var SPARK_VERSION=3.3 and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.

I see. But is a native support of higher spark versions planned at all? If yes, for when is it scheduled?

Joao-DEUS-DE commented 2 months ago

Is there a date for when this update might be expected? I am currenlty working in a project that uses pyspark 3.4.1 in databricks and I would like to use pydeequ

carlacha commented 2 months ago

Hello! Just checking in to see if there's any news on when we might expect that new feature to drop? Any rough idea of a release date? I'm using this library in my project and need to upgrade to Spark 3.4 since we're on Databricks runtime 13.3LTS and would like keep using this. Thanks!

hardiktalati commented 2 months ago

Hey guys, Any plans on upgrade to spark 3.4

hardiktalati commented 2 months ago

@chenliu0831 do you have a release date for spark 3.5 upgrades

chenliu0831 commented 2 months ago

I think we are getting very close https://github.com/awslabs/python-deequ/pull/203 (only 2 test failures down to a dep issue ).

hardiktalati commented 2 months ago

@chenliu0831 how is it looking buddy? can we expect release this week? Also are you doing it for both 3.4 and 3.5?

chenliu0831 commented 2 months ago

@hardiktalati the fix for the 2 failures would need Deequ release I think, please be patient and I will post updates. I think it should solve both 3.4 & 3.5 and we may release it together.

carlacha commented 1 month ago

Hello! Any refreshing news? I know itโ€™s complicated, and we have to be patient. Iโ€™m just checking if there is an approximate release date because my project is blocked and would like keep using this. Thanks๐Ÿ˜Š

datanikkthegreek commented 1 month ago

@chenliu0831 Also from my side this Spark 3.5 is highly awaited ๐Ÿ˜ƒ Observing this thread for some time now.

No Spark 3.5 support would be show stopper using pydeequ and rather an argument for great expectations:)

Looking forward to it and thanks for moving this topic forward

hardiktalati commented 1 month ago

@chenliu0831 bro you mentioned it's nearly done how far

hardiktalati commented 4 weeks ago

@chenliu0831 any updates?? it is more than a month now..

hardiktalati commented 3 weeks ago

@chenliu0831 Would appreciate the response, we are blocked due to the pending upgrade

sqlkabouter commented 2 weeks ago

I'm evaluating PyDeequ vs. Great Expectations and after reading this all PyDeequ seems very unreliable. How can you take over a year add support for Spark 3.4?

hardiktalati commented 2 weeks ago

@chenliu0831 atleast response back ... so that we can make decision

D2Bull commented 1 week ago

We developed a DQ solution based on Pydeequ. After moving to Databricks we lost the ability to continue working with the solution. We would appreciate your update regarding the implementation of SPARK 3.5 and the official support (or supporting Delta Table) as part of Pydeequ.

rdsharma26 commented 1 week ago

@hardiktalati @D2Bull @sqlkabouter We apologize for the inconvenience. We are actively working on the upgrade to Spark 3.4 and we aim to finish it as soon as possible. The upgrade to Spark 3.5 will follow right after.

hardiktalati commented 1 week ago

@rdsharma26 thanks for getting back. Is it possible to know tentative dates so that I can comms back to the colleagues

rdsharma26 commented 4 days ago

@hardiktalati At the moment, we don't have a date to share. We are trying to root cause the failure of two unit tests. Upgrading PyDeequ to Spark 3.4 and using Deequ's 2.0.7 Spark 3.4 library is resulting in the following error.

py4j.protocol.Py4JJavaError: An error occurred while calling o3327.run.
java.lang.NoSuchMethodError: 'breeze.generic.UFunc$UImpl2 breeze.linalg.DenseVector$.dv_dv_Op_Double_OpDiv()'

Once the RCA is done, if a new release of Deequ is required, then it can take a week until PyDeequ is fixed. If the fix is within PyDeequ itself, the new version with Spark 3.4 can be released within a few days.

Once the Spark 3.4 support is added, we will work on Spark 3.5 next.

rdsharma26 commented 3 days ago

We took a different approach from my previous message. Looks like we might need a new Deequ release to upgrade the Breeze dependency for Spark 3.4. In light of that , created a PR that adds Spark 3.5 support: https://github.com/awslabs/python-deequ/pull/210

datanikkthegreek commented 3 days ago

@rdsharma26 Let us know once you have released a release candidate :)

Btw is it worth supporting older spark version? I think mantainance is 18 months. I would probably cut release with older versions at some point. Especially if breaking changes :)

rdsharma26 commented 2 days ago

Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ ๐Ÿš€

@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.

D2Bull commented 1 day ago

Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ ๐Ÿš€

@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.

What I'm missing, it seems that the latest announcement is regarding Spark 3.30. Where is Spark 3.5 mentioned?

๐ŸŽ‰ Announcements ๐ŸŽ‰ NEW!!! 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.

rdsharma26 commented 1 day ago

@D2Bull The README has been updated in the master branch. The project description in PyPI will not change until the next release.

rodrigofp-possiblefinance commented 13 hours ago

Hi folks.

I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.

Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)

Which causes things like this to break

error error_msg

Any thoughts on it?

SemyonSinchenko commented 12 hours ago

Hi folks.

I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.

Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)

Which causes things like this to break

error error_msg

Any thoughts on it?

Just choose single user access mode isolation in Databricks and it will work. This error you mentioned is only related to SparkConnect environment (see Databricks shared access mode limitations)

@MrPowers FYI