apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.9k stars 1.11k forks source link

Configurable null_equals_null flag #11883

Open berkaysynnada opened 1 month ago

berkaysynnada commented 1 month ago

Is your feature request related to a problem or challenge?

Currently the null_equals_null flag in DF joins is hard-coded to false across multiple places in the code. It is not configurable either from a query or session configuration. We need to make this flag configurable to allow more flexibility in how joins handle null values.

Describe the solution you'd like

I am not sure which alternative is more useful and expected, and I couldn't find any examples. If we select it as a session config, users will need to change the setting for each preference change. On the other hand, if it is provided within the query, users must explicitly specify it for each non-default case.

Describe alternatives you've considered

Perhaps both solutions can coexist.

Additional context

No response

korowa commented 1 month ago

Just some thoughts: maybe it's better not to expose this flag, but set it while planning in case of queries like

     with data as (
        select 1 as f
        union all
        select 2 as f
        union all
        select null as f
    )
    select * from data t1
        join data t2
            on t1.f is not distinct from t2.f; -- it (probably) may also be cool to support <=> operator here

right now it's planned as cross (technically) join with filter, but in fact this query may be a hash join with null equals null. And specifying join conditions like this (via is not distinct or spaceship operator) potentially more convenient to control join behaviour (one query statement instead of set datafusion.xxx = true + query statement)

UPD:

MySQL example ```sql select version(); with data as ( select 1 as f union all select 2 as f union all select null as f ) select * from data t1 join data t2 on t1.f <=> t2.f; +-----------+ | version() | +-----------+ | 8.0.27 | +-----------+ +------+------+ | f | f | +------+------+ | 1 | 1 | | 2 | 2 | | NULL | NULL | +------+------+ ```
korowa commented 1 month ago

And in the end this field, probably, is not needed (in its current form at least) as it won't help to handle queries like

JOIN ... ON left.x = right.x AND left.y <=> right.y
berkaysynnada commented 1 month ago

That makes sense. If such usages exist, there's no need to expose a new flag. It would be great if we could handle these two versions and link them to our internal join flags.