StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.54k stars 1.73k forks source link

sum aggregation functions in MV rollup rewritting support complex expr instead of only column-ref.(tpcds-query49, 66) #38815

Closed satanson closed 1 month ago

satanson commented 7 months ago

sum aggregation functions in MV rollup rewritting support complex expr instead of only column-ref(tpcds-query49, 66)

Enhancement

report-version: main-67fc7a1

dataset: tpcds1g

reproduce steps(tpcds-query49):

  1. create MV:
    
    CREATE MATERIALIZED VIEW __mv__ta0012 (_ca0008, _ca0009, _ca0010, _ca0011, d_year, d_moy, wr_return_amt, ws_quantity, ws_net_paid, ws_net_profit, ws_item_sk)
    DISTRIBUTED BY HASH (d_year, d_moy, wr_return_amt, ws_quantity, ws_net_paid, ws_net_profit, ws_item_sk)
    REFRESH ASYNC START("2023-12-01 10:00:00") EVERY(INTERVAL 1 DAY)
    PROPERTIES (
    "replicated_storage" = "true",
    "replication_num" = "1",
    "storage_medium" = "HDD"
    )
    AS
    SELECT
    (sum(_ta0001._ca0002)) AS _ca0008
    ,(sum(_ta0001._ca0003)) AS _ca0009
    ,(sum(_ta0001._ca0004)) AS _ca0010
    ,(sum(_ta0001._ca0005)) AS _ca0011
    ,_ta0001.d_year
    ,_ta0001.d_moy
    ,_ta0001.wr_return_amt
    ,_ta0001.ws_quantity
    ,_ta0001.ws_net_paid
    ,_ta0001.ws_net_profit
    ,_ta0001.ws_item_sk
    FROM
    (
    SELECT
      web_sales.ws_item_sk
      ,(coalesce(web_returns.wr_return_quantity, 0)) AS _ca0002
      ,(coalesce(web_sales.ws_quantity, 0)) AS _ca0003
      ,(coalesce(web_returns.wr_return_amt, 0.00)) AS _ca0004
      ,(coalesce(web_sales.ws_net_paid, 0.00)) AS _ca0005
      ,date_dim.d_year
      ,date_dim.d_moy
      ,web_returns.wr_return_amt
      ,web_sales.ws_quantity
      ,web_sales.ws_net_paid
      ,web_sales.ws_net_profit
    FROM
      web_sales
      INNER JOIN
      web_returns
      ON (web_sales.ws_item_sk = web_returns.wr_item_sk)
         AND (web_sales.ws_order_number = web_returns.wr_order_number)
      INNER JOIN
      date_dim
      ON (web_sales.ws_sold_date_sk = date_dim.d_date_sk)
    ) _ta0001
    GROUP BY
    _ta0001.d_year
    , _ta0001.d_moy
    , _ta0001.wr_return_amt
    , _ta0001.ws_quantity
    , _ta0001.ws_net_paid
    , _ta0001.ws_net_profit
    , _ta0001.ws_item_sk;

2. explain costs  tpcds-query49.sql

EXPLAIN costs -- query 49 select channel, item, return_ratio, return_rank, currency_rank from (select 'web' as channel ,web.item ,web.return_ratio ,web.return_rank ,web.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select ws.ws_item_sk as item ,(cast(sum(coalesce(wr.wr_return_quantity,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(wr.wr_return_amt,0)) as decimal(15,4))/ cast(sum(coalesce(ws.ws_net_paid,0)) as decimal(15,4) )) as currency_ratio from web_sales ws left outer join web_returns wr on (ws.ws_order_number = wr.wr_order_number and ws.ws_item_sk = wr.wr_item_sk) ,date_dim where wr.wr_return_amt > 10000 and ws.ws_net_profit > 1 and ws.ws_net_paid > 0 and ws.ws_quantity > 0 and ws_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by ws.ws_item_sk ) in_web ) web where ( web.return_rank <= 10 or web.currency_rank <= 10 ) union select 'catalog' as channel ,catalog.item ,catalog.return_ratio ,catalog.return_rank ,catalog.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select cs.cs_item_sk as item ,(cast(sum(coalesce(cr.cr_return_quantity,0)) as decimal(15,4))/ cast(sum(coalesce(cs.cs_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(cr.cr_return_amount,0)) as decimal(15,4))/ cast(sum(coalesce(cs.cs_net_paid,0)) as decimal(15,4) )) as currency_ratio from catalog_sales cs left outer join catalog_returns cr on (cs.cs_order_number = cr.cr_order_number and cs.cs_item_sk = cr.cr_item_sk) ,date_dim where cr.cr_return_amount > 10000 and cs.cs_net_profit > 1 and cs.cs_net_paid > 0 and cs.cs_quantity > 0 and cs_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by cs.cs_item_sk ) in_cat ) catalog where ( catalog.return_rank <= 10 or catalog.currency_rank <=10 ) union select 'store' as channel ,store.item ,store.return_ratio ,store.return_rank ,store.currency_rank from ( select item ,return_ratio ,currency_ratio ,rank() over (order by return_ratio) as return_rank ,rank() over (order by currency_ratio) as currency_rank from ( select sts.ss_item_sk as item ,(cast(sum(coalesce(sr.sr_return_quantity,0)) as decimal(15,4))/cast(sum(coalesce(sts.ss_quantity,0)) as decimal(15,4) )) as return_ratio ,(cast(sum(coalesce(sr.sr_return_amt,0)) as decimal(15,4))/cast(sum(coalesce(sts.ss_net_paid,0)) as decimal(15,4) )) as currency_ratio from store_sales sts left outer join store_returns sr on (sts.ss_ticket_number = sr.sr_ticket_number and sts.ss_item_sk = sr.sr_item_sk) ,date_dim where sr.sr_return_amt > 10000 and sts.ss_net_profit > 1 and sts.ss_net_paid > 0 and sts.ss_quantity > 0 and ss_sold_date_sk = d_date_sk and d_year = 2001 and d_moy = 12 group by sts.ss_item_sk ) in_store ) store where ( store.return_rank <= 10 or store.currency_rank <= 10 ) ) t1 order by 1,4,5,2 limit 100;


3. use trace logs mv to find out the root cause of mv rewriting failure of query49's snippet.

trace logs mv select ws.ws_item_sk as item, sum(coalesce(wr.wr_return_quantity, 0)), sum(coalesce(ws.ws_quantity, 0)), sum(coalesce(wr.wr_return_amt, 0)), sum(coalesce(ws.ws_net_paid, 0)) from web_sales ws INNER JOIN web_returns wr ON (ws.ws_item_sk = wr.wr_item_sk) AND (ws.ws_order_number = wr.wr_order_number) INNER JOIN date_dim ON (ws.ws_sold_date_sk = date_dim.d_date_sk) group by ws.ws_item_sk;


error info: 

| 18ms| [MV TRACE] [REWRITE TF_MV_AGGREGATE_JOIN_RULE mvta0012] Construct 1 relation id mappings from query to mv | | 18ms| [MV TRACE] [REWRITE TF_MV_AGGREGATE_JOIN_RULE mvta0012] Rewrite aggregate sum(coalesce(50: wr_return_amt, 0.00)) failed: only column-ref is supported after rewrite | | 18ms| [MV TRACE] [REWRITE TF_MV_AGGREGATE_JOIN_RULE mvta0012] Rewrite aggregate sum(coalesce(50: wr_return_amt, 0.00)) failed: cannot get rollup aggregate



### How to setup StarRocks cluster and prepare tpcds-1g data
1. setup a cluster: https://docs.starrocks.io/docs/cover_pages/developers/
2. prepare data(use scale-factor=1 instead): https://docs.starrocks.io/docs/benchmarking/TPC_DS_Benchmark/
github-actions[bot] commented 1 month ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!