awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

[Feature Request]DuckDB as another analytic engine for Deequ #128

Open chenliu0831 opened 1 year ago

chenliu0831 commented 1 year ago

Is your feature request related to a problem? Please describe. Today, PyDeequ is a PySpark binding for Deequ which is in Scala and Spark only. While it is a good fit for DEs, Spark is not a great fit for many DS use-cases who will have datasets fit in memory and do not want to setup Spark.

See initial ideas here https://youtu.be/fvKFOfaLwBA?t=1393 from @sscdotopen.

Describe the solution you'd like

As discussed in above video, it would be good to create the proper abstractions to support another analytic engine. DuckDB who has gained popularity recently can be another analytic engine. The designs need more thoughts/discussion.

tdoehmen commented 1 year ago

Hi, I am a PhD student who has worked with @sscdotopen on implementing this idea. The main challenge was indeed seperating the business logic from the execution engine. We published a workshop paper about that (https://ssc.io/pdf/duckdq.pdf) and released the source code here: https://github.com/tdoehmen/duckdq

chenliu0831 commented 1 year ago

@tdoehmen happy to hear from you! We will take a look at the paper and the repo. Will you be open to evolve/migrate some code to Pydeequ if that's the right thing to do?

In addition to bring another analytic engine, in general we want to improve the ergonomics of Deequ working for Python users.

chenliu0831 commented 1 year ago

For the business logic - execution engine separation, I think ibis could be an useful candidate to help. We might add a detailed write-up on design here.

tdoehmen commented 1 year ago

I am a bit limited on time, but happy to look over a more detailed design. When parts of duckdq should make sense to be migrated I'm also happy to help out. Just a few more random thoughts on the challenges involved in extending (py)deequ with another execution engine:

duckdq