MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Defer `RowSetFinishing` extraction until view inlining #8678

Open frankmcsherry opened 2 years ago

frankmcsherry commented 2 years ago

Our SELECT expressions extract a RowSetFinishing, which introduces an order and projects away hidden columns that may be required for the order. This currently happens early in the process, before we perform view inlining and cross view optimization. As a result, we may miss opportunities for simplified queries that could be handled with the finishing actions.

For example,

materialize=> create table foo (a int);
CREATE TABLE
materialize=> create view bar as select * from foo limit 1;
CREATE VIEW
materialize=> explain select * from bar;
              Optimized Plan               
-------------------------------------------
 %0 =                                     +
 | Get materialize.public.foo (u1)        +
 | TopK group=() order=() limit=1 offset=0+

(1 row)

materialize=>

This view could be have been implemented as

materialize=> explain select * from foo limit 1;
                  Optimized Plan                  
--------------------------------------------------
 %0 =                                            +
 | Get materialize.public.foo (u1)               +
                                                 +
 Finish order_by=() limit=1 offset=0 project=(#0)+

(1 row)

materialize=> 

The former one has the flexibility to avoid the finishing if it likes, as it does not have an ordering constraint (even if one was added to the inner SELECT, the outer SELECT removes it). However, if we think that the finishing action is the most efficient way to do things, the second plan would probably be preferred.

frankmcsherry commented 2 years ago

It is challenging to fully defer the extraction until after optimization, because we may be required to return ordered results to the user (an outer ORDER BY clause) and the optimization that occurs up to this point is against an un-ordered multiset.

However, it seems plausible that we could only extract ORDER BY (and projecting away ordering columns) early, optimize the resulting expression as an unordered multiset, and then extract whatever remains post optimization. If the root node post optimization is a TopK with the same ordering, just with a LIMIT and OFFSET, great extract it. If for any reason that is not the case, no sweat and just build the right dataflow.

It seems possible that we will miss some opportunities that our current approach would have forced to be extracted. However, we are also clearly missing cases now that we would love to put through optimization and then extract. I don't have a great a priori sense for which is better / more important, so leaving these notes here instead!

asenac commented 2 years ago

AFAIU there is no way to compare a column-offset based ORDER BY properties before and after the optimization process, since things like column pruning may alter the meaning of those offsets. Also, ORDER BY #1, #2 could be simplified as ORDER BY #1 by the optimizer if #2 functionally depends on #1, making the comparison even harder, if even possible.

We could limit that solution to RowSetFinishing instances without explicit ordering. We could install those RowSetFinishings in the dataflow asTopK operators and then extract it from the optimized dataflow. However, cases like the following will not be covered by that solution.

select * from (select * from ... order by a, b limit 10) order by a, b limit 5;
wangandi commented 2 years ago

Relatively unimportant, but because RowSetFinishing extraction happens early, we can't fold a TopK into a constant select.

materialize=> explain select * from (values (1, 2, 3), (2, 3, 4)) limit 0;
                    Optimized Plan                     
-------------------------------------------------------
 %0 =                                                 +
 | Constant (1, 2, 3) (2, 3, 4)                       +
                                                      +
 Finish order_by=() limit=0 offset=0 project=(#0..=#2)+

(1 row)

In this plan, we render the full constant and then run RowSetFinishing on it.

Thanks to @philip-stoev for bringing this to my attention.