Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

Joins on empty tables give exceptions #3071

Open ukclivecox opened 1 month ago

ukclivecox commented 1 month ago

Describe the bug

SQL Joins, either inner or left fail when right hand side is an empty table.

To Reproduce

import daft

df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})

df2 = daft.from_pydict({"idx":[],"score":[]})

df_sql = daft.sql("select * from df1 join df2 on (df1.idx=df2.idx)").show()

This fails with

InvalidSQLException: Daft error: DaftError::External Unable to create logical plan node.
Due to: DaftError::ValueError Can't join on null type expressions: col(idx)

This also happens with left joins where its more usual to be joining with tables that may have missing rows. It works if the table has at least 1 row but fails with the same error if the table has no rows.

Expected behavior

An inner join on an empty table with valid column schema should return an empty dataframe. A left join on a table with empty rows should return NULLs for all columns of right table.

Component(s)

SQL

Additional context

Testing with nightly build 0.3.8+dev0019.e4c6f3fa that has additions for joins from https://github.com/Eventual-Inc/Daft/pull/3066

universalmind303 commented 1 month ago

This appears to be an issue with the core engine, not just the sql implementation.

This produces the same error.

df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})
df2 = daft.from_pydict({"idx":[],"score":[]})

df1.join(df2, on="idx").show()
jaychia commented 2 weeks ago

This is an issue with typing rather than the joins being problematic:

In [4]: df1 = daft.from_pydict({"idx":[1,2],"val":[10,20]})
   ...: df2 = daft.from_pydict({"idx":daft.Series.from_pylist([]).cast(daft.DataType.int64()),"score":[]})
   ...: 
   ...: df1.join(df2, on="idx").show()
╭───────┬───────┬───────╮
│ idx   ┆ val   ┆ score │                                                                                                                                                                                                                                                       
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Null  │
╞═══════╪═══════╪═══════╡
╰───────┴───────┴───────╯

(No data to display: Materialized dataframe has no rows)

I think we could fix our join logic to allow joins against null columns though. @kevinzwang do you think that might be a fast fix?