Open vbarua opened 4 days ago
@vbarua Thanks. This bug report is well written.
DataFusion CLI v42.0.0
> EXPLAIN SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
INTERSECT ALL
SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
+---------------+-----------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------+
| logical_plan | LeftSemi Join: column1 = column1 |
| | Values: (Utf8("a")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))... |
| | Values: (Utf8("b")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c")) |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(column1@0, column1@0)] |
| | ValuesExec |
| | ValuesExec |
| | |
+---------------+-----------------------------------------------------------------------------------+
The query generates a left semi-join plan and therefore will return only LHS values. If RHS happens to have the minimum number of duplicates, then this query will always return incorrect results.
DataFusion CLI v42.0.0
> EXPLAIN SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
+---------------+-----------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------+
| logical_plan | LeftAnti Join: column1 = column1 |
| | Values: (Utf8("a")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))... |
| | Values: (Utf8("b")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c")) |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(column1@0, column1@0)] |
| | ValuesExec |
| | ValuesExec |
| | |
+---------------+-----------------------------------------------------------------------------------+
Here the query generates a left anti-join. So it will always exclude rows which match in RHS.
I think the big question here is whether this means that intersect (and except) need to have their own logical plan nodes, after all. The alternative iiuc is to introduce something like a row number expression in the logical plan which will be used in the join and dropped afterwards.
I think the big question here is whether this means that intersect (and except) need to have their own logical plan nodes, after all.
Maybe not 🤔. From the doc comments it seems to me like we can reuse LogicalPlan::Union
for all set operators.
https://github.com/apache/datafusion/blob/636f43321acfd295096ad3ec45ef00595203f3f7/datafusion/expr/src/logical_plan/plan.rs#L230-L233
I am not sure I understand how Union
could be used. To me it makes sense to represent them in the logical plan as that would make it easier to control how they are optimized/translated into more primitive operations. There is probably multiple ways to translate them and the prefereable one might not be the same for single node vs distributed query engine.
Describe the bug
According to the SQL spec, when returning duplicate records from INTERSECT ALL the minimum number of copies from either input should be returned. Specifically:
DataFusion currently returns ALL copies of duplicated records from the RHS.
To Reproduce
The following query
returns 3 copies of the record
('c')
which does not match the expected behaviour based on the spec.Note that only 2 copies of
('b')
are returned, so this only appears to affect the RHS.Expected behavior
The above query should return 2 copies of the record
('c')
Additional context
See DB Fiddle for Postgres which showcases the expected behaviour: https://www.db-fiddle.com/f/ja4BG5CfyEvak5ScoBwCZr/0