Open walterddr opened 1 year ago
CC @sonam @snleee
One of the problems here is that the two window functions belong to different window groups. The default for ROW_NUMBER is ROWS type frame and for COUNT it’s RANGE type. I need to check if calcite splits the window groups for these if ORDER BY isn’t present, but my recent PR adds ROWS support for window functions with ORDER BY allowing such a query to run if for COUNT the explicit ROWS UNBOUNDED PRECEDING is passed. We decided not to support multiple window groups in the same query due to the complications around how exchanges should be done and which part should handle which window function
I’ll get back to you later on my testing without ORDER BY, but just a caveat here that calcite doesn’t let you use ROWS without ORDER BY (it will compile but calcite will override frame type as RANGE)
also just to make sure I understand, what’s the intended frames behavior here with the two window functions?
supporting multiple window groups in a generic way will require some design and implementation to get the exchanges right. That’s why we decided not to support it in phase 1.
yes I thought so as well so in addition to (1) i proposed the (2) and (3) optimization and see if that's possible --> simply being that it doesn't introduce additional exchange, which in most of the cases is the significant overhead in computation.
Yes breaking it into (2) + (3) and optimizing out the second exchange should be possible. The planner will need to be modified to a) rewrite the query and b) identify the exchange removal optimization (I.e. identify that the partition by clause is the same)
for now I am ok with (3) only b/c that's the performance boost part. syntatic suger part (2) can be debated --> if we have (1) we dont really need (2)
Cool. Would it be fair to check the node’s child and if it’s a window function with same partitioning (and ordering if present in both) we skip adding the exchange? We still need exchanges if ORDER BY key is added/changed.
queries with the same partition but a missing or non-missing order-by clause:
should be allowed, b/c both are partitioned by a
can we do it within the same window?
if not, can we create a plan similar to writing the SQL as the following automatically
SELECT row_num, COUNT(*) OVER (PARTITION BY a) AS cnt FROM tmp