StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9k stars 1.81k forks source link

Push down Aggregation in the plans containing in/exists subquery before entrance of Optimizer's Memo. #38152

Closed satanson closed 3 months ago

satanson commented 10 months ago

Enhancement

Motivation

For TPCDS-query69, the query contains exists-subquery predicates, so left anti/semi joins appear in the final plan, the upper aggregation fails to be materialized for re-use because left anti/semi joins contains highly selective predicates, in the real scenarios, these predicates may be changing, so materializing upper aggregation is a bad choice. however, we can split the upper aggregation into two, then push down the first aggregation across left anti/semi joins, the second aggregation just rollups the result of the first to get the final result, in this new plan, the first aggregation can be materialized.

Original query69

-- query 69
select  
  cd_gender,
  cd_marital_status,
  cd_education_status,
  count(*) cnt1,
  cd_purchase_estimate,
  count(*) cnt2,
  cd_credit_rating,
  count(*) cnt3
 from
  customer c,customer_address ca,customer_demographics
 where
  c.c_current_addr_sk = ca.ca_address_sk and
  ca_state in ('KY', 'GA', 'NM') and
  cd_demo_sk = c.c_current_cdemo_sk and 
  exists (select *
          from store_sales,date_dim
          where c.c_customer_sk = ss_customer_sk and
                ss_sold_date_sk = d_date_sk and
                d_year = 2001 and
                d_moy between 4 and 4+2) and
   (not exists (select *
            from web_sales,date_dim
            where c.c_customer_sk = ws_bill_customer_sk and
                  ws_sold_date_sk = d_date_sk and
                  d_year = 2001 and
                  d_moy between 4 and 4+2) and
    not exists (select * 
            from catalog_sales,date_dim
            where c.c_customer_sk = cs_ship_customer_sk and
                  cs_sold_date_sk = d_date_sk and
                  d_year = 2001 and
                  d_moy between 4 and 4+2))
 group by cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating
 order by cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating
 limit 100;

The equivalent query that is rewritten from query69 manully to imitate aggregation pushdown.

-- query 69
with cte as (select
  c_customer_sk,  
  cd_gender,
  cd_marital_status,
  cd_education_status,
  count(*) cnt1,
  cd_purchase_estimate,
  count(*) cnt2,
  cd_credit_rating,
  count(*) cnt3
from
  customer c,customer_address ca,customer_demographics
where
  c.c_current_addr_sk = ca.ca_address_sk and
  ca_state in ('KY', 'GA', 'NM') and
  cd_demo_sk = c.c_current_cdemo_sk
group by c_customer_sk,
      cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating
)
select           
  cd_gender,
  cd_marital_status,
  cd_education_status,
  sum(cnt1) cnt1,
  cd_purchase_estimate,
  sum(cnt2) cnt2,
  cd_credit_rating,
  sum(cnt3) cnt3
from cte
where   
   exists (select *
          from store_sales,date_dim
          where cte.c_customer_sk = ss_customer_sk and
                ss_sold_date_sk = d_date_sk and
                d_year = 2001 and
                d_moy between 4 and 4+2) and
   (not exists (select *
            from web_sales,date_dim
            where cte.c_customer_sk = ws_bill_customer_sk and
                  ws_sold_date_sk = d_date_sk and
                  d_year = 2001 and
                  d_moy between 4 and 4+2) and
    not exists (select * 
            from catalog_sales,date_dim
            where cte.c_customer_sk = cs_ship_customer_sk and
                  cs_sold_date_sk = d_date_sk and
                  d_year = 2001 and
                  d_moy between 4 and 4+2))
group by  cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating                          
 order by cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating
 limit 100;

Implementation contraint

Althrough SR support aggregation pushdown, however, it works in post-Memo time. materialized view adoption is determined in the in-memo/pre-memo time. so, we should support pushdown aggregation before entrance memo.

XinzhuangL commented 10 months ago

@satanson Hi, Could you assign it to me?

satanson commented 10 months ago

@XinzhuangL my email address is ranpanf@gmail.com, if you have some questions, reach me via email.

XinzhuangL commented 10 months ago

@XinzhuangL my email address is ranpanf@gmail.com, if you have some questions, reach me via email.

get

github-actions[bot] commented 4 months ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!