awslabs / python-deequ

Python API for Deequ
Apache License 2.0
676 stars 131 forks source link

Match deequ support for spark 3.2.1 #93

Closed epilif1017a closed 1 year ago

epilif1017a commented 2 years ago

Deequ now supports Spark 3.2.1. However, pydeequ still did not catch up to spark 3.1.x.

Goal: update pydeequ to support new deequ version and Spark 3.2.1

ghirardinicola commented 2 years ago

Beside not being up to date with deeque version in the packages (pydeequ.deequ_maven_coord)), are there other problems?

epilif1017a commented 2 years ago

Hi @ghirardinicola thanks for reaching out :)

Honestly we can’t tell, as we are still on Spark 3.1.2 in our framework (holding to decide if we cut deequ off of a light version of the framework or not) and avoiding rolling out the dq part of the framework globally because we unfortunately cannot fork the project internally at the moment to keep up with deequ (or contribute to the open source project, maybe one day we find capacity to do it).

So in 3.1.2 there was for us the issue of in certain scenarios the Spark app wouldn’t finish automatically if we had pydeequ actions (but we manage to sort that out by manually closing the Spark context gateway (I think you have this on your issue list also, at least I remember seeing an issue). And on 3.2 we did not test yet but I believe there’s issues in your issue list reporting that some analyzers do not work.

therefore it would give the pydeequ user base much more confidence in the project if there was a faster release cycle between Spark versions, deequ and pydeequ. But don’t get me wrong, we all understand that as an open source project the dev team is already kind enough to spend their time to work extra on the project. But I believe because this project and deequ are so cool that their roadmap is very important to potential heavy users like us for example.

That’s why we are so interested and always asking for a new version :) but we fully understand that things take time, would be cool to know if there are still plans to keep updating the project or not, and that would help the community making the decision of forking the project, go the extra mile to find time to go through all the code and start contributing to the os project, or make other decision.

Appreciate all your help and kindness to put this open to everyone!

mycaule commented 2 years ago

The ApproxCountDistinct analyzer doesn't work, ConstraintSuggestionRunner and ColumnProfilerRunner neither.

java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression. ...'

hashanpasindu commented 1 year ago

does pydeequ ==1.0.1 supports for spark 3.2 ?

mycaule commented 1 year ago

yes it also generally works with spark 3.3 and 3.1, but some components don't

hashanpasindu commented 1 year ago

Thanks for the reply. Im trying to use column profiler and Im getting java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression. ... Error . I use pyspark 3.2.1 on databricks and pydeeq 1.0.1.

Is there any worker around ?

mycaule commented 1 year ago

There isn't, it is the same problem I had above. You should also have a look at this the release is expected very soon https://github.com/awslabs/python-deequ/issues/106

chenliu0831 commented 1 year ago

1.1.0 is released with Spark 3.2/3.3 support - I would close this for now