Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.92k stars 123 forks source link

Feature request : add support for SQL #2468

Closed djouallah closed 5 days ago

djouallah commented 4 weeks ago

Daft seems like a cool project and it has enough functionalities to be considered useful enough, but for a lot of new users, learning a new API is annoying, it would be really nice to expose a SQL Interface

jaychia commented 4 weeks ago

Thanks @djouallah! We're actively considering this and there are a lot of people recently asking for it...

What do you think if Daft only support SQL queries (DML), but none of the other DDL or persistence stuff like:

I'm imagining an API like this to start with:

df = daft.read_deltalake(...)

df = daft.sql(f"""
SELECT x + 1 FROM {df} WHERE x == 1
""")

df.to_pandas()

Where .sql essentially only supports SELECT and nothing else.

djouallah commented 4 weeks ago

ideally this, I have already 4 engines supported

https://colab.research.google.com/drive/1OONpJzwYMQMRMDQmZmW0OD-wl3n4k-JS#scrollTo=2JDOw59r1Qbf

df = daft.read_deltalake(...)
daft.register("table1",df)

daft.sql("""
SELECT x + 1 FROM table1 WHERE x == 1
""").show()
RCHowell commented 1 week ago

Daft could take one of several approaches to supporting SQL — each of which presents future opportunities.

Approach A — DataFusion (recommended)

The DataFusion SQL crate is likely the best first-step to Daft SQL support. The logical algebra should be easier to translate to Daft builders than the sqlparser AST. This library effectively does the algebra translation for us.

This crate provides a general purpose SQL query planner that can parse SQL and translate queries into logical plans. Although this crate is used by the DataFusion query engine, it was designed to be easily usable from any project that requires a SQL query planner and does not make any assumptions about how the resulting logical plan will be translated to a physical plan. For example, there is no concept of row-based versus columnar execution in the logical plan.

There is existing support for handling multiple input dialects with sqlparser, and we could evolve a custom Daft SQL Analyzer (semantic analysis / logical planning only) without changing the parser (sqlparser) library.

Approach B — sqlparser + Daft Analyzer

This approach is effectively Approach A sans Datafusion SQL; that is, we effectively fork datafusion-sql to have full control over the sqlparser AST to logical algebra translation/analyzer. I understand that this approach gives full control over semantic analysis, but I think this may be too ambitious for a first-pass.

My reservations with Approach B are..

  1. I am not aware of datafusion-sql limitations that warrant the extra effort.
  2. If they exist, we'll find them when it matters ... aka we are limited by them
  3. Any limitations present an opportunity to work with the datafusion community to improve that project.
  4. Approach A is an incremental step to Approach B

Approach C — SQLGlot + Daft Analyzer (python)

This is much like Approach B but coming at it from the Python side. That is, we use the SQLGlot as the SQL parser to support its various (21) dialects, but then add the missing analyzer/planner piece.

However, SQLGlot does not aim to be a SQL validator, so it may fail to detect certain syntax errors.

This analyzer would be written in Python and leverage catalog metadata libraries such as PyIceberg to perform the necessary semantic analysis. Like Approach B this would be a bit more ambitious, so I still recommend Approach A. I also believe that a SQLGlot layer could be decorated on top of Approach A — that is, suppose there's a stabilized "Daft SQL" dialect and we want to support multiple inputs .. we can implement a SQLGlot custom dialect that we redirect the input to.

Something like,

df = daft.sql("SELECT ....")  # approach A

# later releases..

df1 = daft.sql("SELECT ...") # approach A still exists
df2 = daft.sql("SELECT ...", dialect="duckdb") # invoke SQLGlot with input="duckdb-sql" and output="daft-sql"
universalmind303 commented 1 week ago

@RCHowell after some internal discussion, I think we are leaning towards something similar to "Approach B". The first one requires a dependency on datafusion, which would greatly increase our compile time and build size. "Approach C" requires keeping the logic in python, and we've been actively trying to move more of our core logic into rust.

RCHowell commented 1 week ago

The first one requires a dependency on datafusion, which would greatly increase our compile time and build size

I believe we would only need the datafusion-sql crate rather than the full datafusion crate. I'm not sure if that changes your assessment.

universalmind303 commented 6 days ago

I believe we would only need the datafusion-sql crate rather than the full datafusion crate.

yes, unfortunately since we use a custom arrow implementation (a fork of arrow2) we can't pull in any datafusion crates without pulling in multiple heavy dependencies such as arrow-rs and parquet

RCHowell commented 6 days ago

Understood, thank you for the clarification!