apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.91k stars 1.12k forks source link

feat: Support `unnest` cross join with table in `FROM` clause #9394

Closed jayzhan211 closed 2 months ago

jayzhan211 commented 6 months ago

The syntax is like select * from unnest(), unnest(), table;

postgres=# select * from unnest(ARRAY[1,2,3]), unnest(ARRAY[4,5]) as b2, t1;
 unnest | b2 |    a    
--------+----+---------
      1 |  4 | {1,2,3}
      2 |  4 | {1,2,3}
      3 |  4 | {1,2,3}
      1 |  5 | {1,2,3}
      2 |  5 | {1,2,3}
      3 |  5 | {1,2,3}
      1 |  4 | {4,5}
      2 |  4 | {4,5}
      3 |  4 | {4,5}
      1 |  5 | {4,5}
      2 |  5 | {4,5}
      3 |  5 | {4,5}
(12 rows)

t1 is table with 2 rows ([1,2,3], [4,5])

Similar for Duckdb, a is [1,2,3] and [10, 20]

D select * from unnest([1,2,3]), unnest([4,5]) as b2, t1;
┌──────────────────────────┬───────────────────────┬───────────┐
│ main.list_value(1, 2, 3) │ main.list_value(4, 5) │     a     │
│          int32           │         int32         │  int32[]  │
├──────────────────────────┼───────────────────────┼───────────┤
│                        1 │                     4 │ [1, 2, 3] │
│                        2 │                     4 │ [1, 2, 3] │
│                        3 │                     4 │ [1, 2, 3] │
│                        1 │                     5 │ [1, 2, 3] │
│                        2 │                     5 │ [1, 2, 3] │
│                        3 │                     5 │ [1, 2, 3] │
│                        1 │                     4 │ [10, 20]  │
│                        2 │                     4 │ [10, 20]  │
│                        3 │                     4 │ [10, 20]  │
│                        1 │                     5 │ [10, 20]  │
│                        2 │                     5 │ [10, 20]  │
│                        3 │                     5 │ [10, 20]  │
├──────────────────────────┴───────────────────────┴───────────┤
│ 12 rows                                            3 columns │
└──────────────────────────────────────────────────────────────┘

t1 is table with 2 rows ([1,2,3], [10, 20])

In datafusion, we have the similar result but the ordering is different. I think we should understand the reason of the different ordering in cross join.

DataFusion CLI v36.0.0
❯ create table t1 as values([1,2,3]), ([10, 20]);
0 rows in set. Query took 0.010 seconds.

❯ select * from unnest([1,2,3]), unnest([4,5]) as b2, t1;
+----------------------------------------+-------------------------------+-----------+
| make_array(Int64(1),Int64(2),Int64(3)) | make_array(Int64(4),Int64(5)) | column1   |
+----------------------------------------+-------------------------------+-----------+
| 1                                      | 4                             | [1, 2, 3] |
| 1                                      | 4                             | [10, 20]  |
| 1                                      | 5                             | [1, 2, 3] |
| 1                                      | 5                             | [10, 20]  |
| 2                                      | 4                             | [1, 2, 3] |
| 2                                      | 4                             | [10, 20]  |
| 2                                      | 5                             | [1, 2, 3] |
| 2                                      | 5                             | [10, 20]  |
| 3                                      | 4                             | [1, 2, 3] |
| 3                                      | 4                             | [10, 20]  |
| 3                                      | 5                             | [1, 2, 3] |
| 3                                      | 5                             | [10, 20]  |
+----------------------------------------+-------------------------------+-----------+
12 rows in set. Query took 0.014 seconds.

t1 is table with 2 rows ([1,2,3], [10,20])

We can see that the ordering is different only for UnnestExec, it implies that crossjoin is probably doing the different things for UnnestExec and MemoryExec

statement ok
create table t1 as values
  ([1,2,3], 1),
  ([4,5,6], 2)
;

statement ok
create table t2 as values
  (10),
  (20)
;

query ?II
select * from t1 cross join t2;
----
[1, 2, 3] 1 10
[4, 5, 6] 2 10
[1, 2, 3] 1 20
[4, 5, 6] 2 20

query TT
explain select * from t1 cross join t2;
----
logical_plan
CrossJoin:
--TableScan: t1 projection=[column1, column2]
--TableScan: t2 projection=[column1]
physical_plan
ProjectionExec: expr=[column1@1 as column1, column2@2 as column2, column1@0 as column1]
--CrossJoinExec
----MemoryExec: partitions=1, partition_sizes=[1]
----MemoryExec: partitions=1, partition_sizes=[1]

query II
select * from unnest([1,2,3]) as b2, t2;
----
1 10
1 20
2 10
2 20
3 10
3 20

query TT
explain select * from unnest([1,2,3]) as b2, t2;
----
logical_plan
CrossJoin:
--SubqueryAlias: b2
----Unnest: make_array(Int64(1),Int64(2),Int64(3))
------Projection: List([1, 2, 3]) AS make_array(Int64(1),Int64(2),Int64(3))
--------EmptyRelation
--TableScan: t2 projection=[column1]
physical_plan
CrossJoinExec
--UnnestExec
----ProjectionExec: expr=[[1, 2, 3] as make_array(Int64(1),Int64(2),Int64(3))]
------PlaceholderRowExec
--MemoryExec: partitions=1, partition_sizes=[1]

_Originally posted by @jayzhan211 in https://github.com/apache/arrow-datafusion/pull/9355#discussion_r1505255606_

jonahgao commented 6 months ago

I think this might be caused by join reorder, execute the following two queries in DataFusion

DataFusion CLI v36.0.0
❯ create table a1 as values
  ([1,2,3], 1),
  ([4,5,6], 2);

❯ create table a2 as values 
  (1),  
  (2);

❯ create table b as values (10), (20), (30);

❯ select * from a1,b;
+-----------+---------+---------+
| column1   | column2 | column1 |
+-----------+---------+---------+
| [1, 2, 3] | 1       | 10      |
| [4, 5, 6] | 2       | 10      |
| [1, 2, 3] | 1       | 20      |
| [4, 5, 6] | 2       | 20      |
| [1, 2, 3] | 1       | 30      |
| [4, 5, 6] | 2       | 30      |
+-----------+---------+---------+
6 rows in set. Query took 0.006 seconds.

❯ select * from a2,b;
+---------+---------+
| column1 | column1 |
+---------+---------+
| 1       | 10      |
| 1       | 20      |
| 1       | 30      |
| 2       | 10      |
| 2       | 20      |
| 2       | 30      |
+---------+---------+
6 rows in set. Query took 0.006 seconds.

The result of query select * from a2,b is the same to DuckDB, but query select * from a1,b is different. This is because query select * from a1,b has been reordered, and we can confirm it with explain verbose. Differentiate the left and right tables by the number of rows in statistics.

This reordering behavior is triggered by the bytes size of tables. https://github.com/apache/arrow-datafusion/blob/ca37ce37933f7874d404364cb8c23438baceb46d/datafusion/core/src/physical_optimizer/join_selection.rs#L73-L77

Because we need to load the left side into memory, this behavior should be reasonable.

Additionally, I've also noticed that there is a TODO here, which might be helpful.

// TODO: In PrestoSQL, the optimizer flips join sides only if one side is much smaller than the other by more than SIZE_DIFFERENCE_THRESHOLD times, by default is is 8 times.

duongcongtoai commented 2 months ago

take

duongcongtoai commented 2 months ago

I took a look at Postgres cross join behavior, it looks like the order of the result is not deterministic either (meaning it feel free to decide which table is small and should be materialized inside memory, and the order of the output will be based on this decision)

Example: create table_a with 3 rows table_b and table_b_small with the same schema, table_b with 10k rows and table_b_small with 2 rows

CREATE TABLE table_a (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50)
);
CREATE TABLE table_b (
    id SERIAL PRIMARY KEY,
    description VARCHAR(50)
);
CREATE TABLE table_b_small (
    id SERIAL PRIMARY KEY,
    description VARCHAR(50)
);
INSERT INTO table_a (name) VALUES
('Alice'),
('Bob'),
('Charlie');

INSERT INTO table_b_small (description) VALUES
('Description 1'),
('Description 2');

DO $$
BEGIN
    FOR i IN 1..10000 LOOP
        INSERT INTO table_b (description)
        VALUES ('Description ' || i);
    END LOOP;
END $$;
postgres=# select a.name, b.description from table_a as a, table_b as b;
  name   |    description    
---------+-------------------
 Alice   | Description 1
 Bob     | Description 1
 Charlie | Description 1
 Alice   | Description 2
 Bob     | Description 2
 Charlie | Description 2

postgres=# select a.name, b.description from table_a as a, table_b_small as b;
  name   |  description  
---------+---------------
 Alice   | Description 1
 Alice   | Description 2
 Bob     | Description 1
 Bob     | Description 2
 Charlie | Description 1
 Charlie | Description 2

Query plans

postgres=# explain select a.name, b.description from table_a as a, table_b as b;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop  (cost=0.00..67694.27 rows=5401080 width=134)
   ->  Seq Scan on table_b b  (cost=0.00..164.02 rows=10002 width=16)
   ->  Materialize  (cost=0.00..18.10 rows=540 width=118)
         ->  Seq Scan on table_a a  (cost=0.00..15.40 rows=540 width=118)
(4 rows)

postgres=# explain select a.name, b.description from table_a as a, table_b_small as b;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..3677.15 rows=291600 width=236)
   ->  Seq Scan on table_a a  (cost=0.00..15.40 rows=540 width=118)
   ->  Materialize  (cost=0.00..18.10 rows=540 width=118)
         ->  Seq Scan on table_b_small b  (cost=0.00..15.40 rows=540 width=118)
(4 rows)
jonahgao commented 2 months ago

I took a look at Postgres cross join behavior, it looks like the order of the result is not deterministic either (meaning it feel free to decide which table is small and should be materialized inside memory, and the order of the output will be based on this decision)

@duongcongtoai Yes. Order itself is not an issue. A relation is an unordered set. "Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples."

and as stated in the PostgreSQL documentation:

If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on.

jonahgao commented 2 months ago

Let's close this issue for now; we can reopen it if there are any other problems.