Open SemyonSinchenko opened 4 months ago
This would be an awesome addition to keep this lib working on Spark Connect clusters!
That is an awesome prototype! I think it would be great to allow users to leverage pydeequ directly from Spark Connect without forcing them to have a local JVM running.
@SemyonSinchenko impressive prototype and thanks for opening this request. I really want to get rid of the current py4j bridge as well!
About 550 lines of Scala code that implements plugin and protobuf parser;
I'm stilling catching up on what's needed here for Deequ, but likely this would be the most interesting and complex change. We should think about how to simplify it if possible.
Tagging @rdsharma26 to hear his thoughts given most of the changes would happen in Deequ side.
I'm stilling catching up on what's needed here for Deequ, but likely this would be the most interesting and complex change. We should think about how to simplify it if possible.
Hello!
Since I have already done it, I can explain in detail what should be done in Deequ-core:
fromProto
;If you want, I can open a draft-PR in Deequ core.
Massive +1 for this
@chenliu0831, what are your thoughts priorities-wise on this?
In my mind, many of the teams that are using Deequ in the first place are also following software best practices (local development, unit testing, CI/CD, etc.), which often means using spark/databricks connect against a dev cluster from your IDE.
Our team really wants to use Deequ, but it's not worth compromising on our development workflow. @SemyonSinchenko, I would love it if you could raise a draft PR so the community can slowly contribute and refine it. 🙂
@stevenayers Currently, I'm developing my Spark Connect compatible wrapper as a separate project. Understand me correctly, I do not want to work on a PR for deequ
and pydeequ
without a prior agreement with the core maintainers of both projects. At the moment, I see no interest from the core developers in my project, and that is perfectly fine and understandable, since the maintenance of anything merged into deequ/pydeequ will fall on their shoulders.
At the moment my project already contains a partially covered by tests scala connect plugin and additional helpers as well a user-friendly PySpark API. I released recently both JAR and Python wheel as a GitHub release, so you can try it. Feel free also to raise any issue or request a feature. The link to my project: https://github.com/SemyonSinchenko/tsumugi-spark
@stevenayers Currently, I'm developing my Spark Connect compatible wrapper as a separate project. Understand me correctly, I do not want to work on a PR for
deequ
andpydeequ
without a prior agreement with the core maintainers of both projects. At the moment, I see no interest from the core developers in my project, and that is perfectly fine and understandable, since the maintenance of anything merged into deequ/pydeequ will fall on their shoulders.At the moment my project already contains a partially covered by tests scala connect plugin and additional helpers as well a user-friendly PySpark API. I released recently both JAR and Python wheel as a GitHub release, so you can try it. Feel free also to raise any issue or request a feature. The link to my project: https://github.com/SemyonSinchenko/tsumugi-spark
That's understandable, and thanks for providing a link to the wrapper. I hope this gets added as core functionality 🙂
Is your feature request related to a problem? Please describe. At the moment
python-deequ
relies on direct calls to the underlying Spark JVM viaSparkSession._jvm
. But in the PySpark Connect_jvm
is not available at all. So, at the momentpython-deequ
works only PySpark Classic, not in PySpark Connect. There were already reported problems from some of Spark-Connect users: reported incompatibility of python-deequ with Spark-Connect.Describe the solution you'd like
python-deequ
an alternative implementation of API that is built on top of generated from protobuf python classes.Describe alternatives you've considered At the moment I do not know any other alternative. Based on the documentation of Spark-Connect and discussions in Spark mailing list / Jira, there are no plans to support
_jvm
in Connect.Additional context It is not so hard to do. I did it for educational purposes recently and it required:
python-deequ
.I'm willing to make all the required work and support the protobuf code with a new Python API.
In the deequ-core it may be done as a submodule. In the python-deequ it may be done as an extras. For example,
pip install pydeequ
will work as today, butpip install pydeequ[connect]
will install also some protobuf dependencies and generated from proto classes.It seems to me that adopting of deequ/python-deequ to Spark-Connect may be done without breaking changes. And it opens not only a way to make pydeequ works on Spark-Connect, but also a potential for creating others deequ APIs for Spark-Connect Go, Spark-Connect Rust, etc. Also it opens a way to use deequ from Spark-Connect via Java/Scala.
P.S. Using protobuf allows also to decouple Scala from Python at all and avoid all the problems with
default
,Option
, etc. And it may be used not only in Connect but in PySpark Classic too.