awslabs / python-deequ

Python API for Deequ
Apache License 2.0
730 stars 136 forks source link

Deequ/PyDeequ compatibility with Spark-Connect #221

Open SemyonSinchenko opened 4 months ago

SemyonSinchenko commented 4 months ago

Is your feature request related to a problem? Please describe. At the moment python-deequ relies on direct calls to the underlying Spark JVM via SparkSession._jvm. But in the PySpark Connect _jvm is not available at all. So, at the moment python-deequ works only PySpark Classic, not in PySpark Connect. There were already reported problems from some of Spark-Connect users: reported incompatibility of python-deequ with Spark-Connect.

Describe the solution you'd like

Describe alternatives you've considered At the moment I do not know any other alternative. Based on the documentation of Spark-Connect and discussions in Spark mailing list / Jira, there are no plans to support _jvm in Connect.

Additional context It is not so hard to do. I did it for educational purposes recently and it required:

I'm willing to make all the required work and support the protobuf code with a new Python API.

In the deequ-core it may be done as a submodule. In the python-deequ it may be done as an extras. For example, pip install pydeequ will work as today, but pip install pydeequ[connect] will install also some protobuf dependencies and generated from proto classes.

It seems to me that adopting of deequ/python-deequ to Spark-Connect may be done without breaking changes. And it opens not only a way to make pydeequ works on Spark-Connect, but also a potential for creating others deequ APIs for Spark-Connect Go, Spark-Connect Rust, etc. Also it opens a way to use deequ from Spark-Connect via Java/Scala.

P.S. Using protobuf allows also to decouple Scala from Python at all and avoid all the problems with default, Option, etc. And it may be used not only in Connect but in PySpark Classic too.

MrPowers commented 4 months ago

This would be an awesome addition to keep this lib working on Spark Connect clusters!

grundprinzip commented 4 months ago

That is an awesome prototype! I think it would be great to allow users to leverage pydeequ directly from Spark Connect without forcing them to have a local JVM running.

chenliu0831 commented 3 months ago

@SemyonSinchenko impressive prototype and thanks for opening this request. I really want to get rid of the current py4j bridge as well!

About 550 lines of Scala code that implements plugin and protobuf parser;

I'm stilling catching up on what's needed here for Deequ, but likely this would be the most interesting and complex change. We should think about how to simplify it if possible.

Tagging @rdsharma26 to hear his thoughts given most of the changes would happen in Deequ side.

SemyonSinchenko commented 3 months ago

I'm stilling catching up on what's needed here for Deequ, but likely this would be the most interesting and complex change. We should think about how to simplify it if possible.

Hello!

Since I have already done it, I can explain in detail what should be done in Deequ-core:

  1. Definitions of proto3 messages and the corresponding plugin for Maven/SBT. These proto files cover about 90% of all Deequ APIs;
  2. A utility to create Deequ objects from proto3 messages. This code, about ~550 LOC. I do not expect more code. A separate helper object allows to make all changes without even touching the existing Deequ codebase. An alternative implementation will be to extend existing Scala helper objects with a method like fromProto;
  3. Server plugin to handle Deequ protocol messages. Very simple, less than 100 LOC;

If you want, I can open a draft-PR in Deequ core.

stevenayers commented 1 month ago

Massive +1 for this

stevenayers commented 1 month ago

@chenliu0831, what are your thoughts priorities-wise on this?

In my mind, many of the teams that are using Deequ in the first place are also following software best practices (local development, unit testing, CI/CD, etc.), which often means using spark/databricks connect against a dev cluster from your IDE.

Our team really wants to use Deequ, but it's not worth compromising on our development workflow. @SemyonSinchenko, I would love it if you could raise a draft PR so the community can slowly contribute and refine it. 🙂

SemyonSinchenko commented 1 month ago

@stevenayers Currently, I'm developing my Spark Connect compatible wrapper as a separate project. Understand me correctly, I do not want to work on a PR for deequ and pydeequ without a prior agreement with the core maintainers of both projects. At the moment, I see no interest from the core developers in my project, and that is perfectly fine and understandable, since the maintenance of anything merged into deequ/pydeequ will fall on their shoulders.

At the moment my project already contains a partially covered by tests scala connect plugin and additional helpers as well a user-friendly PySpark API. I released recently both JAR and Python wheel as a GitHub release, so you can try it. Feel free also to raise any issue or request a feature. The link to my project: https://github.com/SemyonSinchenko/tsumugi-spark

stevenayers commented 1 month ago

@stevenayers Currently, I'm developing my Spark Connect compatible wrapper as a separate project. Understand me correctly, I do not want to work on a PR for deequ and pydeequ without a prior agreement with the core maintainers of both projects. At the moment, I see no interest from the core developers in my project, and that is perfectly fine and understandable, since the maintenance of anything merged into deequ/pydeequ will fall on their shoulders.

At the moment my project already contains a partially covered by tests scala connect plugin and additional helpers as well a user-friendly PySpark API. I released recently both JAR and Python wheel as a GitHub release, so you can try it. Feel free also to raise any issue or request a feature. The link to my project: https://github.com/SemyonSinchenko/tsumugi-spark

That's understandable, and thanks for providing a link to the wrapper. I hope this gets added as core functionality 🙂