apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.3k stars 1.19k forks source link

EXCEPT ALL returns the wrong number of records #12956

Open vbarua opened 1 month ago

vbarua commented 1 month ago

Describe the bug

According to the SQL spec, when handling EXCEPT ALL the number of copies returned of a given record is the maximum of 0 OR the number of copies in the LHS minus the RHS.

Specifically:

            b) If a set operator is specified, then the result of applying
              the set operator is a table containing the following rows:

              i) Let R be a row that is a duplicate of some row in T1 or of
                 some row in T2 or both. Let m be the number of duplicates
                 of R in T1 and let n be the number of duplicates of R in
                 T2, where m � 0 and n � 0.

            ...            

            iii) If ALL is specified, then

                 Case:

                 1) If UNION is specified, then the number of duplicates of
                   R that T contains is (m + n).

                 2) If EXCEPT is specified, then the number of duplicates of
                   R that T contains is the maximum of (m - n) and 0.

                 3) If INTERSECT is specified, then the number of duplicates
                   of R that T contains is the minimum of m and n.

DataFusion currently removes all copies of a record if it is present in the RHS.

To Reproduce

The following query

➜  ~ datafusion-cli
DataFusion CLI v42.0.0
> SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('c');
+---------+
| column1 |
+---------+
| a       |
+---------+

returns 0 copies of ('b') and ('c') which does not match the behaviour from the spec.

Expected behavior

According to the SQL spec there should be 1 copy of ('b') and 2 copies of ('c')

Additional context

See DB Fiddle for Postgres, which showcases the expected results https://www.db-fiddle.com/f/ja4BG5CfyEvak5ScoBwCZr/1

jcsherin commented 1 month ago
DataFusion CLI v42.0.0
> EXPLAIN SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
EXCEPT ALL
SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
+---------------+-----------------------------------------------------------------------------------+
| plan_type     | plan                                                                              |
+---------------+-----------------------------------------------------------------------------------+
| logical_plan  | LeftAnti Join: column1 = column1                                                  |
|               |   Values: (Utf8("a")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))...      |
|               |   Values: (Utf8("b")), (Utf8("b")), (Utf8("b")), (Utf8("c")), (Utf8("c"))         |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                       |
|               |   HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(column1@0, column1@0)] |
|               |     ValuesExec                                                                    |
|               |     ValuesExec                                                                    |
|               |                                                                                   |
+---------------+-----------------------------------------------------------------------------------+

Here the query generates a left anti-join. So it will always exclude rows which match in RHS.