apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Null-Aware Anti Join Support #4211

Open mingmwang opened 1 year ago

mingmwang commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

For Anti joins, null key values in both sides need to be carefully treated. http://structureddata.org/2008/05/22/null-aware-anti-join/

When we have an Anti Join(Left Anti or Right Anti), like A Left Anti Join B on (A.id = B.id), the real join conditions should be A Left Anti Join B on ((Or(A.id = B.id), IsNull(A.id = B.id))), this can not be executed as HashJoins directly.

For left Anti Join, if the right side contains Null values, should return Empty Relation For right Anti Join, if the left side contains Null values, should return Empty Relation. For HashJoin, it is relatively easy to implement such logic when the partition mode is CollectLeft, there is challenge when the partition mode is Partitioned.

Section 6: Null-Aware Anti Join (NAAJ) http://www.vldb.org/pvldb/vol2/vldb09-423.pdf

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

mingmwang commented 1 year ago

@alamb @jackwener @liukun4515

Please share your thoughts.

alamb commented 1 year ago

I will respond tomorrow -- also cc @Dandandan who may be interested

mingmwang commented 1 year ago

For the Anti Joins converted from EXCEPT clause, our implement is correct.
For the Anti Joins converted from NOT EXISTS, our implement should be correct, need to confirm. For the Anti Joins converted from NOT IN, we have to handle Null cases as NAAJ.

alamb commented 1 year ago

It would not surprise me at all if there are bugs handling ANTI joins in DataFusion. As I recall there is a bunch of subtlety regarding nulls when there are multi-column equijoins as well and there are nulls in some but not all columns 🤔

My recommendation to make improvements in this area is start by writing tests (perhaps borrow some / all of the duckdb tests): https://github.com/duckdb/duckdb/tree/master/test/sql/join to figure what, if anything, we are missing

It would be nice to have a proper data driven test runner -- I will write up a ticket and see if I can get anyone excited about this 🤔

alamb commented 1 year ago

Wrote up my dream testing tool: https://github.com/apache/arrow-datafusion/issues/4248

liukun4515 commented 1 year ago

For the Anti Joins converted from EXCEPT clause, our implement is correct. For the Anti Joins converted from NOT EXISTS, our implement should be correct, need to confirm. For the Anti Joins converted from NOT IN, we have to handle Null cases as NAAJ.

I find the Join in the datafusion with a attribute

    /// If null_equals_null is true, null == null else null != null
    pub null_equals_null: bool,

, can this resolve the Null-Aware in the HASH-JOIN?

mingmwang commented 1 year ago

For the Anti Joins converted from EXCEPT clause, our implement is correct. For the Anti Joins converted from NOT EXISTS, our implement should be correct, need to confirm. For the Anti Joins converted from NOT IN, we have to handle Null cases as NAAJ.

I find the Join in the datafusion with a attribute

    /// If null_equals_null is true, null == null else null != null
    pub null_equals_null: bool,

, can this resolve the Null-Aware in the HASH-JOIN?

No.