apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Inconsistent `IS NOT NULL` filters for joins #6432

Open sarahyurick opened 1 year ago

sarahyurick commented 1 year ago

Describe the bug

In the Dask-SQL project, we have relied on DataFusion to create IS NOT NULL filters at the TableScan level whenever a column is involved in a join. However, it looks like recent changes may have removed this feature?

To Reproduce

The query

SELECT d_col
       FROM c_table
       JOIN d_table ON d_col=c_col

has the LogicalPlan

Projection: d_table.d_col
  Inner Join:  Filter: d_table.d_col = c_table.c_col
    TableScan: c_table projection=[c_col]
    TableScan: d_table projection=[d_col]

Expected behavior

It still works when we write the query with a WHERE clause.

SELECT d_col
       FROM c_table, d_table WHERE d_col=c_col

produces

Projection: d_table.d_col
  Inner Join: c_table.c_col= d_table.d_col
    TableScan: c_table projection=[c_col], full_filters=[c_table.c_col IS NOT NULL]
    TableScan: d_table projection=[d_col], full_filters=[d_table.d_col IS NOT NULL]

Additional context

I'm not quite sure when this change was introduced and if so, why? Is this something that DataFusion would be willing to fix, or would it be preferred that Dask-SQL re-adds the optimizer rule on our side?

cc @ayushdg @jdye64

mingmwang commented 1 year ago

I will take a look. There is a rule FilterNullJoinKeys which will filter out the null keys.

mingmwang commented 1 year ago

Looks like by default this optimization is disabled. Could you please help to confirm which version of DataFusion you run in the past ?

pub filter_null_join_keys: bool, default = false

sarahyurick commented 1 year ago

Thanks @mingmwang ! Knowing which rule/variable to look at is helpful. I'm still not able to see the IS NOT NULL filters though, even when I verbosely initialize with let config = OptimizerContext::new().filter_null_keys(true);.

Is it possible that IS NOT NULL isn't being added because it's a Inner Join: Filter: d_table.d_col = c_table.c_col instead of a Inner Join: c_table.c_col= d_table.d_col?

mingmwang commented 1 year ago

I will take a closer look.