apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.41k stars 1.27k forks source link

[multistage][discuss] window agg with same partition but with/without order by #11627

Open walterddr opened 1 year ago

walterddr commented 1 year ago

queries with the same partition but a missing or non-missing order-by clause:

SELECT
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) AS row_num,
  COUNT(*) OVER (PARTITION BY a) AS cnt
from tbl

should be allowed, b/c both are partitioned by a

  1. can we do it within the same window?

  2. if not, can we create a plan similar to writing the SQL as the following automatically

    
    WITH tmp AS (
    SELECT 
    ROW_NUMBER() OVER (PARTITION BY a ORDER BY b) AS row_num
    FROM tbl
    )

SELECT row_num, COUNT(*) OVER (PARTITION BY a) AS cnt FROM tmp


3. and when execute: can we optimize out the exchange
  LogicalWindow(window#0=[window(partition {3} aggs [COUNT()])])
    PinotLogicalExchange(distribution=[hash[3]]) <-- get rid of this exchange? 
        LogicalWindow(window#0=[window(partition {3} order by [10] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])])
          PinotLogicalSortExchange(distribution=[hash[3]], collation=[[10]], isSortOnSender=[false], isSortOnReceiver=[true])
                LogicalTableScan(table=[[tbl]])
walterddr commented 1 year ago

CC @sonam @snleee

somandal commented 1 year ago

One of the problems here is that the two window functions belong to different window groups. The default for ROW_NUMBER is ROWS type frame and for COUNT it’s RANGE type. I need to check if calcite splits the window groups for these if ORDER BY isn’t present, but my recent PR adds ROWS support for window functions with ORDER BY allowing such a query to run if for COUNT the explicit ROWS UNBOUNDED PRECEDING is passed. We decided not to support multiple window groups in the same query due to the complications around how exchanges should be done and which part should handle which window function

I’ll get back to you later on my testing without ORDER BY, but just a caveat here that calcite doesn’t let you use ROWS without ORDER BY (it will compile but calcite will override frame type as RANGE)

also just to make sure I understand, what’s the intended frames behavior here with the two window functions?

supporting multiple window groups in a generic way will require some design and implementation to get the exchanges right. That’s why we decided not to support it in phase 1.

walterddr commented 1 year ago

yes I thought so as well so in addition to (1) i proposed the (2) and (3) optimization and see if that's possible --> simply being that it doesn't introduce additional exchange, which in most of the cases is the significant overhead in computation.

somandal commented 1 year ago

Yes breaking it into (2) + (3) and optimizing out the second exchange should be possible. The planner will need to be modified to a) rewrite the query and b) identify the exchange removal optimization (I.e. identify that the partition by clause is the same)

walterddr commented 1 year ago

for now I am ok with (3) only b/c that's the performance boost part. syntatic suger part (2) can be debated --> if we have (1) we dont really need (2)

somandal commented 1 year ago

Cool. Would it be fair to check the node’s child and if it’s a window function with same partitioning (and ordering if present in both) we skip adding the exchange? We still need exchanges if ORDER BY key is added/changed.