We'll have to normalize the MirScalarExpr, e.g., we want to support
k >= ROW_NUMBER()
ROW_NUMBER() <= k
> and <
maybe even +-1 on the ROW_NUMBER side of the comparison, to support it when the user adjusts to the <=/<
ROW_NUMBER() = 1 and variations
When normalizing stuff, we should pay attention to not break provably_not_in_part!
Note that we have to track the ROW_NUMBER column across operators, because the Filter itself will just refer to an earlier ROW_NUMBER by ColRef.
(The prefix sum trick won't solve most ROW_NUMBER use cases, because all row numbers change when a new record appears at the beginning of the sort order.)
Note that some users don't just want to filter out all but the first, but just want to mark the first. So there is an extra LEFT JOIN needed. Example 1, 2.
One more pattern: count(...) OVER (...) = 1 for marking the first one in each partition. Note that this filters nulls. (Although it turned out that the user's original count query was not doing what they wanted to actually do.)
When
ROW_NUMBER
is used for a TopK or Top1 (which seems to be quite common, even our internal VIEWs have it https://github.com/MaterializeInc/materialize/issues/16584), our optimizer could translate to our efficient, builtin TopK (or Top1), see https://materialize.com/docs/sql/patterns/top-k/We'll have to normalize the MirScalarExpr, e.g., we want to support
k >= ROW_NUMBER()
ROW_NUMBER() <= k
>
and<
ROW_NUMBER
side of the comparison, to support it when the user adjusts to the<=
/<
ROW_NUMBER() = 1
and variationsWhen normalizing stuff, we should pay attention to not break
provably_not_in_part
!Note that we have to track the
ROW_NUMBER
column across operators, because theFilter
itself will just refer to an earlierROW_NUMBER
byColRef
.(The prefix sum trick won't solve most
ROW_NUMBER
use cases, because all row numbers change when a new record appears at the beginning of the sort order.)And after we are done with
ROW_NUMBER
, we should similarly handle other similar functions, e.g.RANK
,DENSE_RANK
,PERCENT_RANK
,CUME_DIST
,NTILE
. I'll break this out into a separate issue later, but the gist is that we'll need to add some flags on TopK to make it able to handle ties in various ways. Btw. there is an example forRANK() = 1
in one of our internal queries. Also, there is a user who wantsDENSE_RANK() = 1
.Someone wanted to use
array_agg(...)[1]
.Note that some users don't just want to filter out all but the first, but just want to mark the first. So there is an extra LEFT JOIN needed. Example 1, 2.
One more pattern:
count(...) OVER (...) = 1
for marking the first one in each partition. Note that this filters nulls. (Although it turned out that the user's original count query was not doing what they wanted to actually do.)One more pattern:
row_number() IN (1,2,3)
.