Open machadoluiz opened 4 months ago
We treat backward compatibility very seriously, as all AWS API or owned library does. Dropping support for EOL Spark version can be an option but it need a bit more research.
I don't think it's very hard to fix https://github.com/awslabs/python-deequ/issues/169 but the change should be made at Deequ Scala land (adding overloaded functions with the old parameters). We currently do not have a date.
As a workaround, you can set env var SPARK_VERSION=3.3
and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.
As a workaround, you can set env var
SPARK_VERSION=3.3
and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.
I see. But is a native support of higher spark versions planned at all? If yes, for when is it scheduled?
Is there a date for when this update might be expected? I am currenlty working in a project that uses pyspark 3.4.1 in databricks and I would like to use pydeequ
Hello! Just checking in to see if there's any news on when we might expect that new feature to drop? Any rough idea of a release date? I'm using this library in my project and need to upgrade to Spark 3.4 since we're on Databricks runtime 13.3LTS and would like keep using this. Thanks!
Hey guys, Any plans on upgrade to spark 3.4
@chenliu0831 do you have a release date for spark 3.5 upgrades
I think we are getting very close https://github.com/awslabs/python-deequ/pull/203 (only 2 test failures down to a dep issue ).
@chenliu0831 how is it looking buddy? can we expect release this week? Also are you doing it for both 3.4 and 3.5?
@hardiktalati the fix for the 2 failures would need Deequ release I think, please be patient and I will post updates. I think it should solve both 3.4 & 3.5 and we may release it together.
Hello! Any refreshing news? I know itโs complicated, and we have to be patient. Iโm just checking if there is an approximate release date because my project is blocked and would like keep using this. Thanks๐
@chenliu0831 Also from my side this Spark 3.5 is highly awaited ๐ Observing this thread for some time now.
No Spark 3.5 support would be show stopper using pydeequ and rather an argument for great expectations:)
Looking forward to it and thanks for moving this topic forward
@chenliu0831 bro you mentioned it's nearly done how far
@chenliu0831 any updates?? it is more than a month now..
@chenliu0831 Would appreciate the response, we are blocked due to the pending upgrade
I'm evaluating PyDeequ vs. Great Expectations and after reading this all PyDeequ seems very unreliable. How can you take over a year add support for Spark 3.4?
@chenliu0831 atleast response back ... so that we can make decision
We developed a DQ solution based on Pydeequ. After moving to Databricks we lost the ability to continue working with the solution. We would appreciate your update regarding the implementation of SPARK 3.5 and the official support (or supporting Delta Table) as part of Pydeequ.
@hardiktalati @D2Bull @sqlkabouter We apologize for the inconvenience. We are actively working on the upgrade to Spark 3.4 and we aim to finish it as soon as possible. The upgrade to Spark 3.5 will follow right after.
@rdsharma26 thanks for getting back. Is it possible to know tentative dates so that I can comms back to the colleagues
@hardiktalati At the moment, we don't have a date to share. We are trying to root cause the failure of two unit tests. Upgrading PyDeequ to Spark 3.4 and using Deequ's 2.0.7 Spark 3.4 library is resulting in the following error.
py4j.protocol.Py4JJavaError: An error occurred while calling o3327.run.
java.lang.NoSuchMethodError: 'breeze.generic.UFunc$UImpl2 breeze.linalg.DenseVector$.dv_dv_Op_Double_OpDiv()'
Once the RCA is done, if a new release of Deequ is required, then it can take a week until PyDeequ is fixed. If the fix is within PyDeequ itself, the new version with Spark 3.4 can be released within a few days.
Once the Spark 3.4 support is added, we will work on Spark 3.5 next.
We took a different approach from my previous message. Looks like we might need a new Deequ release to upgrade the Breeze dependency for Spark 3.4. In light of that , created a PR that adds Spark 3.5 support: https://github.com/awslabs/python-deequ/pull/210
@rdsharma26 Let us know once you have released a release candidate :)
Btw is it worth supporting older spark version? I think mantainance is 18 months. I would probably cut release with older versions at some point. Especially if breaking changes :)
Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ ๐
@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.
Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ ๐
@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.
What I'm missing, it seems that the latest announcement is regarding Spark 3.30. Where is Spark 3.5 mentioned?
๐ Announcements ๐ NEW!!! 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
@D2Bull The README has been updated in the master
branch. The project description in PyPI will not change until the next release.
Hi folks.
I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.
Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)
Which causes things like this to break
Any thoughts on it?
Hi folks.
I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.
Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)
Which causes things like this to break
![]()
Any thoughts on it?
Just choose single user access mode
isolation in Databricks and it will work. This error you mentioned is only related to SparkConnect environment (see Databricks shared access mode
limitations)
@MrPowers FYI
Is your feature request related to a problem? Please describe. I'm currently facing issues with the PyDeequ support to Apache Spark version 3.4.0, since it is impacting several projects in my organization that uses PyDeequ as a data quality tool. The problem arises because our EMR clusters are required to support the latest version releases, but since the release of emr-6.12.0, the support for Apache Spark 3.3.x has been dropped.
Describe the solution you'd like I would like PyDeequ to be updated to support Apache Spark 3.4.0 and ideally, also the most recent version 3.5.0. I would also like to understand the requirements for this support, such as whether there are any backwards compatibility requirements for PyDeequ, and whether it is necessary for all future PyDeequ versions to continue supporting all of the currently supported Spark and Deequ versions, or if there is scope for dropping support for some versions, as mentioned on #178.
Describe alternatives you've considered As an alternative, we have considered migrating to Great Expectations due to its active maintenance and large community. However, PyDeequ is still preferred due to its seamless integration with our internal PySpark library. The transition to a new tool would also require significant resources and time. Therefore, having PyDeequ support Apache Spark 3.4.0 and 3.5.0 would be the most beneficial solution for us.
Additional context It seems that Deequ is already supporting Apache Spark 3.4.0 (#505) and most recently 3.5.0 (#514).