MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.71k stars 466 forks source link

Compile `ROW_NUMBER <= k` patterns to `TopK` #16758

Open ggevay opened 1 year ago

ggevay commented 1 year ago

When ROW_NUMBER is used for a TopK or Top1 (which seems to be quite common, even our internal VIEWs have it https://github.com/MaterializeInc/materialize/issues/16584), our optimizer could translate to our efficient, builtin TopK (or Top1), see https://materialize.com/docs/sql/patterns/top-k/

We'll have to normalize the MirScalarExpr, e.g., we want to support

When normalizing stuff, we should pay attention to not break provably_not_in_part!

Note that we have to track the ROW_NUMBER column across operators, because the Filter itself will just refer to an earlier ROW_NUMBER by ColRef.

(The prefix sum trick won't solve most ROW_NUMBER use cases, because all row numbers change when a new record appears at the beginning of the sort order.)

And after we are done with ROW_NUMBER, we should similarly handle other similar functions, e.g. RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE. I'll break this out into a separate issue later, but the gist is that we'll need to add some flags on TopK to make it able to handle ties in various ways. Btw. there is an example for RANK() = 1 in one of our internal queries. Also, there is a user who wants DENSE_RANK() = 1.

Someone wanted to use array_agg(...)[1].

Note that some users don't just want to filter out all but the first, but just want to mark the first. So there is an extra LEFT JOIN needed. Example 1, 2.

One more pattern: count(...) OVER (...) = 1 for marking the first one in each partition. Note that this filters nulls. (Although it turned out that the user's original count query was not doing what they wanted to actually do.)

One more pattern: row_number() IN (1,2,3).

ggevay commented 1 year ago

When working on this, we should check that we are solving https://github.com/MaterializeInc/materialize/issues/16584 (if it's not solved until then by manually changing that to DISTINCT ON).

ggevay commented 6 months ago

Users keep running into this.