StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.54k stars 1.73k forks source link

Add PushDownAggregationBelowUnionRule #46060

Open satanson opened 3 months ago

satanson commented 3 months ago

For tpcds-query76, AggregateOperator can be push down below UnionAll operator.

-- query 76
select  channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM (
        SELECT 'store' as channel, 'ss_store_sk' col_name, d_year, d_qoy, i_category, ss_ext_sales_price ext_sales_price
         FROM store_sales, item, date_dim
         WHERE ss_store_sk IS NULL
           AND ss_sold_date_sk=d_date_sk
           AND ss_item_sk=i_item_sk
        UNION ALL
        SELECT 'web' as channel, 'ws_ship_customer_sk' col_name, d_year, d_qoy, i_category, ws_ext_sales_price ext_sales_price
         FROM web_sales, item, date_dim
         WHERE ws_ship_customer_sk IS NULL
           AND ws_sold_date_sk=d_date_sk
           AND ws_item_sk=i_item_sk
        UNION ALL
        SELECT 'catalog' as channel, 'cs_ship_addr_sk' col_name, d_year, d_qoy, i_category, cs_ext_sales_price ext_sales_price
         FROM catalog_sales, item, date_dim
         WHERE cs_ship_addr_sk IS NULL
           AND cs_sold_date_sk=d_date_sk
           AND cs_item_sk=i_item_sk) foo
GROUP BY channel, col_name, d_year, d_qoy, i_category
ORDER BY channel, col_name, d_year, d_qoy, i_category
limit 100;

After rewrite, the equivalent query would be

select  channel, col_name, d_year, d_qoy, i_category, sales_cnt,sales_amt FROM (
        SELECT 'store' as channel, 'ss_store_sk' col_name, d_year, d_qoy, i_category, count(*) as sales_cnt,  SUM(ss_ext_sales_price) as  sales_amt
         FROM store_sales, item, date_dim
         WHERE ss_store_sk IS NULL
           AND ss_sold_date_sk=d_date_sk
           AND ss_item_sk=i_item_sk
        GROUP BY d_year, d_qoy, i_category   
        UNION ALL
        SELECT 'web' as channel, 'ws_ship_customer_sk' col_name, d_year, d_qoy, i_category, count(*) as sales_cnt,  SUM(ws_ext_sales_price) as  sales_amt
         FROM web_sales, item, date_dim
         WHERE ws_ship_customer_sk IS NULL
           AND ws_sold_date_sk=d_date_sk
           AND ws_item_sk=i_item_sk
        GROUP BY d_year, d_qoy, i_category    
        UNION ALL
        SELECT 'catalog' as channel, 'cs_ship_addr_sk' col_name, d_year, d_qoy, i_category, count(*) as sales_cnt,  SUM(cs_ext_sales_price) as  sales_amt
         FROM catalog_sales, item, date_dim
         WHERE cs_ship_addr_sk IS NULL
           AND cs_sold_date_sk=d_date_sk
           AND cs_item_sk=i_item_sk
        GROUP BY d_year, d_qoy, i_category
        ) foo
ORDER BY channel, col_name, d_year, d_qoy, i_category
limit 100;

since each union all branch has been added extra columns(channel,col_name) with different values, and this extra columns are present in group-by clause. so aggregation can be pushed down.

This optimization should be a RBO rule in pre-enter-memo phase since it seems always beneficial.

Additionally, after Aggregation push down below union all, we can build a simple SPJG MV to accelerate this query.

Although SR optimizer already has a rule named PushDownAggregateRule can pushdown agg, however, PushDownAggregationBelowUnionRule has distinctions:

  1. Agg should be just placed directly below UnionAll operator, it should not be push down deeply(across JOIN operator);
  2. After rewriting, Agg above UnionAll should be eliminated;
  3. constant values in group-by clause should be eliminated.
FLAYhhh commented 1 month ago

Hi, I would like to take up this issue. Could you please assign it to me?

SoraNimi commented 2 weeks ago

@wangsimo0 @kateshaowanjou can i pick up this issue?