databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.81k stars 743 forks source link

bug: could't run Query 3.x and 4.x in ssb test #7168

Open FANNG1 opened 2 years ago

FANNG1 commented 2 years ago

Search before asking

Version

a6b27a679111b16367a05a3ff6d7ce269f8ea408

What's Wrong?

in ssb test(https://clickhouse.com/docs/en/getting-started/example-datasets/star-schema/), some sql could't finish in server hours.

task Q4.1 for example:

select
    d_year, c_nation,
    sum(lo_revenue - lo_supplycost) as profit
from
    dates, customer, supplier, part, lineorder
where
    lo_custkey = c_custkey
    and lo_suppkey = s_suppkey
    and lo_partkey = p_partkey
    and lo_orderdate = d_datekey
    and c_region = 'AMERICA'
    and s_region = 'AMERICA'
    and (p_mfgr = 'MFGR#1')
group by
    d_year, c_nation
order by
    d_year, c_nation;

explain result:

----------------------------------------------------------------------------------------------------------------------------------------------+
| Project: [d_year (#4),c_nation (#21),profit (#58)]                                                                                                                                                                                                                                                            |
| └── EvalScalar: [sum(lo_revenue - lo_supplycost) (#61)]                                                                                                                                                                                                                                                       |
|     └── Sort: [d_year (#4) ASC, c_nation (#21) ASC]                                                                                                                                                                                                                                                           |
|         └── Aggregate(Final): group items: [dates.d_year (#4), customer.c_nation (#21)], aggregate functions: [sum(lo_revenue - lo_supplycost)]                                                                                                                                                               |
|             └── Aggregate(Partial): group items: [dates.d_year (#4), customer.c_nation (#21)], aggregate functions: [sum(lo_revenue - lo_supplycost)]                                                                                                                                                         |
|                 └── EvalScalar: [-(lineorder.lo_revenue (#53), lineorder.lo_supplycost (#54))]                                                                                                                                                                                                                |
|                     └── HashJoin: INNER, build keys: [lineorder.lo_custkey (#43), lineorder.lo_suppkey (#45), lineorder.lo_partkey (#44), lineorder.lo_orderdate (#46)], probe keys: [customer.c_custkey (#17), supplier.s_suppkey (#25), part.p_partkey (#32), dates.d_datekey (#0)], join filters: []       |
|                         ├── CrossJoin                                                                                                                                                                                                                                                                         |
|                         │   ├── CrossJoin                                                                                                                                                                                                                                                                     |
|                         │   │   ├── CrossJoin                                                                                                                                                                                                                                                                 |
|                         │   │   │   ├── Scan: hive.ssb_100f.dates, filters: []                                                                                                                                                                                                                                |
|                         │   │   │   └── Filter: [customer.c_region (#22) = AMERICA]                                                                                                                                                                                                                           |
|                         │   │   │       └── Scan: hive.ssb_100f.customer, filters: [customer.c_region (#22) = AMERICA]                                                                                                                                                                                        |
|                         │   │   └── Filter: [supplier.s_region (#30) = AMERICA]                                                                                                                                                                                                                               |
|                         │   │       └── Scan: hive.ssb_100f.supplier, filters: [supplier.s_region (#30) = AMERICA]                                                                                                                                                                                            |
|                         │   └── Filter: [part.p_mfgr (#34) = MFGR#1]                                                                                                                                                                                                                                          |
|                         │       └── Scan: hive.ssb_100f.part, filters: [part.p_mfgr (#34) = MFGR#1]                                                                                                                                                                                                           |
|                         └── Scan: hive.ssb_100f.lineorder, filters: []                                                                                                                                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

explain pipeline output:

+-------------------------------------------------------------+
| explain                                                     |
+-------------------------------------------------------------+
| Project × 1 processor                                       |                                                                                           [0/1811]
|   Expression × 1 processor                                  |
|     SortMergeTransform × 1 processor                        |
|       SortMergeTransform × 1 processor                      |
|         SortPartialTransform × 1 processor                  |
|           GroupByFinalTransform × 1 processor               |
|             GroupByPartialTransform × 1 processor           |
|               Expression × 1 processor                      |
|                 HashJoin × 1 processor                      |
|                   HashJoin × 1 processor                    |
|                     HashJoin × 1 processor                  |
|                       HashJoin × 1 processor                |
|                         Rename × 1 processor                |
|                           Project × 1 processor             |
|                             HiveEngineSource × 1 processor  |
|                                                             |
| BuildHashTable × 12 processors                              |
|   Rename × 12 processors                                    |
|     Project × 12 processors                                 |
|       HiveEngineSource × 12 processors                      |
|                                                             |
| BuildHashTable × 12 processors                              |
|   Filter × 12 processors                                    |
|     Rename × 12 processors                                  |
|       Project × 12 processors                               |
|         HiveEngineSource × 12 processors                    |
|                                                             |
| BuildHashTable × 9 processors                               |
|   Filter × 9 processors                                     |
|     Rename × 9 processors                                   |
|       Project × 9 processors                                |
|         HiveEngineSource × 9 processors                     |
|                                                             |
| BuildHashTable × 12 processors                              |
|   Filter × 12 processors                                    |
|     Rename × 12 processors                                  |
|       Project × 12 processors                               |
|         HiveEngineSource × 12 processors                    |
+-------------------------------------------------------------+

How to Reproduce?

1, generate ssb dataset with factor 100 2, run ssb queries, 3.x or 4.x

Are you willing to submit PR?

FANNG1 commented 2 years ago

without CBO, we could't do join reorder and choose join type (shuffle or broadcast), besides this, there seems other problems:

  1. use inter join not cross join
  2. use more than one processor to process hash join even if there're just one HiveEngineSource processor ?
BohuTANG commented 2 years ago

I think @leiysky and @zhang2014 have some suggestions before we have CBO.

leiysky commented 2 years ago

You can try to put the largest table to the left.

For this case I think the largest table is lineorder, you can rewrite the query as:

select
    d_year, c_nation,
    sum(lo_revenue - lo_supplycost) as profit
from
     lineorder, dates, customer, supplier, part
where
    lo_custkey = c_custkey
    and lo_suppkey = s_suppkey
    and lo_partkey = p_partkey
    and lo_orderdate = d_datekey
    and c_region = 'AMERICA'
    and s_region = 'AMERICA'
    and (p_mfgr = 'MFGR#1')
group by
    d_year, c_nation
order by
    d_year, c_nation;
FANNG1 commented 2 years ago

yes , we may run pass the query by changing join order manually, but is this behavior expected?

  1. using cross join
  2. the concurrent keeps one after a join with concurrent one.
leiysky commented 2 years ago

yes , we may run pass the query by changing join order manually, but is this behavior expected?

I have no idea.

Would you please provide the new explain result?

FANNG1 commented 2 years ago

let's talk about it tomorrow :)

leiysky commented 2 years ago

cc @xudong963