Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

[FEAT] Filter predicates in SQL join #3371

Closed kevinzwang closed 21 hours ago

kevinzwang commented 1 day ago

Adds support for things like:

SELECT * FROM a JOIN b ON a.x = b.x AND a.y > 0

Enables TPC-H q13

codspeed-hq[bot] commented 1 day ago

CodSpeed Performance Report

Merging #3371 will not alter performance

Comparing kevin/sql-join-on-filter (c018163) with main (5fee192)

Summary

✅ 17 untouched benchmarks

universalmind303 commented 23 hours ago

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))
codecov[bot] commented 23 hours ago

Codecov Report

Attention: Patch coverage is 87.80488% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.45%. Comparing base (b6695eb) to head (c018163). Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-sql/src/planner.rs 87.70% 15 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3371 +/- ## ========================================== + Coverage 77.39% 77.45% +0.05% ========================================== Files 678 678 Lines 83300 83278 -22 ========================================== + Hits 64469 64501 +32 + Misses 18831 18777 -54 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [src/daft-sql/src/lib.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371?src=pr&el=tree&filepath=src%2Fdaft-sql%2Fsrc%2Flib.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc3FsL3NyYy9saWIucnM=) | `100.00% <100.00%> (ø)` | | | [src/daft-sql/src/planner.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371?src=pr&el=tree&filepath=src%2Fdaft-sql%2Fsrc%2Fplanner.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2RhZnQtc3FsL3NyYy9wbGFubmVyLnJz) | `71.81% <87.70%> (+1.17%)` | :arrow_up: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3371/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)

🚨 Try these New Features:

graphite-app[bot] commented 23 hours ago

Graphite Automations

"Request reviewers once CI passes" took an action on this PR • (11/21/24)

1 reviewer was added to this PR based on Andrew Gazelka's automation.

kevinzwang commented 23 hours ago

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

universalmind303 commented 22 hours ago

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well. ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

Additionally, how would you express this using the dsl col syntax instead of bracket notation. col("df1.a")? or df1.col("a")

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR.

I'll open up an issue for dataframe non-equi joins just so we don't lose this!

kevinzwang commented 18 hours ago

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

DataFusion has a concept of a table reference in a column expression and I think we could something similar if you use df["a"] notation