Open vbarua opened 1 month ago
DataFusion CLI v42.0.0
> EXPLAIN SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
+---------------+-----------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------+
| logical_plan | LeftAnti Join: column1 = column1 |
| | Values: (Utf8("a")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))... |
| | Values: (Utf8("b")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c")) |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(column1@0, column1@0)] |
| | ValuesExec |
| | ValuesExec |
| | |
+---------------+-----------------------------------------------------------------------------------+
Here the query generates a left anti-join. So it will always exclude rows which match in RHS.
Describe the bug
According to the SQL spec, when handling EXCEPT ALL the number of copies returned of a given record is the maximum of 0 OR the number of copies in the LHS minus the RHS.
Specifically:
DataFusion currently removes all copies of a record if it is present in the RHS.
To Reproduce
The following query
returns 0 copies of
('b')
and('c')
which does not match the behaviour from the spec.Expected behavior
According to the SQL spec there should be 1 copy of
('b')
and 2 copies of('c')
Additional context
See DB Fiddle for Postgres, which showcases the expected results https://www.db-fiddle.com/f/ja4BG5CfyEvak5ScoBwCZr/1