Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

Dataframe non-equi joins #3380

Open universalmind303 opened 22 hours ago

universalmind303 commented 22 hours ago
          > > @kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

Additionally, how would you express this using the dsl col syntax instead of bracket notation. col("df1.a")? or df1.col("a")

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR.

I'll open up an issue for dataframe non-equi joins just so we don't lose this!

Originally posted by @universalmind303 in https://github.com/Eventual-Inc/Daft/issues/3371#issuecomment-2489903359