StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.36k stars 1.69k forks source link

Case-when's cast folding causes that tpcds-query40 fails in mv rewriting. #38811

Open satanson opened 6 months ago

satanson commented 6 months ago

Enhancement

report-version: main-67fc7a1

dataset: tpcds1g

reproduce steps:

  1. create MV:

    CREATE MATERIALIZED VIEW __mv__ta0008 (_ca0006, _ca0007, d_date, i_current_price, w_state, i_item_id)
    DISTRIBUTED BY HASH (d_date, i_current_price, w_state, i_item_id)
    REFRESH ASYNC START("2023-12-01 10:00:00") EVERY(INTERVAL 1 DAY)
    PROPERTIES (
    "replicated_storage" = "true",
    "replication_num" = "1",
    "storage_medium" = "HDD"
    )
    AS
    SELECT
    (sum(_ta0001._ca0003)) AS _ca0006
    ,(sum(_ta0001._ca0002)) AS _ca0007
    ,_ta0001.d_date
    ,_ta0001.i_current_price
    ,_ta0001.w_state
    ,_ta0001.i_item_id
    FROM
    (
    SELECT
      warehouse.w_state
      ,item.i_item_id
      ,(if((date_dim.d_date < "2000-03-11"), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS _ca0002
      ,(if(("2000-03-11" <= date_dim.d_date), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS _ca0003
      ,date_dim.d_date
      ,item.i_current_price
    FROM
      catalog_sales
      LEFT OUTER JOIN
      catalog_returns
      ON (catalog_sales.cs_item_sk = catalog_returns.cr_item_sk)
         AND (catalog_sales.cs_order_number = catalog_returns.cr_order_number)
      INNER JOIN
      warehouse
      ON (catalog_sales.cs_warehouse_sk = warehouse.w_warehouse_sk)
      INNER JOIN
      item
      ON (item.i_item_sk = catalog_sales.cs_item_sk)
      INNER JOIN
      date_dim
      ON (catalog_sales.cs_sold_date_sk = date_dim.d_date_sk)
    ) _ta0001
    GROUP BY
    _ta0001.d_date
    , _ta0001.i_current_price
    , _ta0001.w_state
    , _ta0001.i_item_id;
  2. explain costs tpcds-query40.sql

explain costs select  
   w_state
  ,i_item_id
  -- ,sum(case when (cast(d_date as date) < cast ('2000-03-11' as date)) then cs_sales_price - coalesce(cr_refunded_cash,0) else 0 end) as sales_before
  -- ,sum(case when (cast(d_date as date) >= cast ('2000-03-11' as date)) then cs_sales_price - coalesce(cr_refunded_cash,0) else 0 end) as sales_after
  -- ,sum(if((date_dim.d_date < "2000-03-11"), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_before
  -- ,sum(if(("2000-03-11" <= date_dim.d_date), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_after
  --,sum(case when (d_date  <  '2000-03-11') then cs_sales_price - coalesce(cr_refunded_cash,0.00) else 0.00 end) as sales_before
  --,sum(case when (d_date  >= '2000-03-11') then cs_sales_price - coalesce(cr_refunded_cash,0.00) else 0.00 end) as sales_after
  ,sum(if((cast(d_date as date) < cast ('2000-03-11' as date)), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_before
  ,sum(if((cast(d_date as date) >= cast ('2000-03-11' as date)), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_after           
 from
   catalog_sales left outer join catalog_returns on
       (cs_order_number = cr_order_number 
        and cs_item_sk = cr_item_sk)
  ,warehouse 
  ,item
  ,date_dim
 where
     i_current_price between 0.99 and 1.49
 and i_item_sk          = cs_item_sk
 and cs_warehouse_sk    = w_warehouse_sk 
 and cs_sold_date_sk    = d_date_sk
 and d_date between date_add(cast ('2000-03-11' as date), -30)
                and date_add(cast ('2000-03-11' as date), 30)
 group by
    w_state,i_item_id
 order by w_state,i_item_id
limit 100;
  1. try to substitute case-when clause with semantic-equivallent clause to see the results:

test1 : MV rewrite fails, case-when clause with cast expr interpolation matches if-clause in MV

  ,sum(case when (cast(d_date as date) < cast ('2000-03-11' as date)) then cs_sales_price - coalesce(cr_refunded_cash,0) else 0 end) as sales_before
  ,sum(case when (cast(d_date as date) >= cast ('2000-03-11' as date)) then cs_sales_price - coalesce(cr_refunded_cash,0) else 0 end) as sales_after

test2: MV rewriting succeeds, expressions in both query and and mv schema are matches, expressions in query has no cast expr interpolation.

  ,sum(if((date_dim.d_date < "2000-03-11"), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_before
  ,sum(if(("2000-03-11" <= date_dim.d_date), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_after

test3: MV rewriting succeeds : case-when without cast expr interpolation matches if-clause in MV.

  ,sum(case when (d_date  <  '2000-03-11') then cs_sales_price - coalesce(cr_refunded_cash,0.00) else 0.00 end) as sales_before
  ,sum(case when (d_date  >= '2000-03-11') then cs_sales_price - coalesce(cr_refunded_cash,0.00) else 0.00 end) as sales_after

test4: MV rewriting succeeds : if-clause in query with cast expr interpolation matches if-clause in MV.

  ,sum(if((cast(d_date as date) < cast ('2000-03-11' as date)), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_before
  ,sum(if((cast(d_date as date) >= cast ('2000-03-11' as date)), (catalog_sales.cs_sales_price - coalesce(catalog_returns.cr_refunded_cash, 0.00)), 0.00)) AS sales_after 

Conclusion: case-when-clause with cast-expr fails to fold cast expr, which prevent MV rewriting,so we should fix it.

How to setup StarRocks cluster and prepare tpcds-1g data

  1. setup a cluster: https://docs.starrocks.io/docs/cover_pages/developers/
  2. prepare data(use scale-factor=1 instead): https://docs.starrocks.io/docs/benchmarking/TPC_DS_Benchmark/
github-actions[bot] commented 1 week ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!