Optimization/dataflow: have indexes support implicit casts

wangandi commented 2 years ago

TL;DR

For the convenience of the user, indexes on a column should be usable also for contexts where the column has been cast implicitly to another type when the cast does not lose information (i.e., it is an injective function).

Description

When a user enters any SQL query:

select * from foo where foo.a = 'to_match'

The SQL layer may decide to implicitly cast foo.a so that it is the same type as 'to_match'. This implicit cast may happen even if you cast 'to_match' to become the same type as foo.a.

Example:

materialize=> explain raw plan for select selectcol from test where varcharcol = cast('string' as character varying);
                              Raw Plan                              
--------------------------------------------------------------------
 %0 =                                                              +
 | Get materialize.public.test (u19)                               +
 | Filter (varchartostr(#1) = varchartostr(strtovarchar("string")))+
 | Project (#0)

Currently, if you create an index on a column, the index cannot be used in contexts where it has been implicitly cast to something else. In the query above, you would need an index on varcharcol::string instead of an index on varcharcol to speed up the filter. But this is not readily apparent to the user.

Linked Tickets

Case of an implicit cast where the in-memory representation is identical: #10738.
Case of an implicit cast where the in-memory representation is different: #4171.

wangandi commented 2 years ago

The current workaround for users is that they should call explain <query> and see if there are implicit casts in the plan. If yes, create an index for column::<type that column has been cast into>.

wangandi commented 2 years ago

Task Breakdown

Will edit this comment as I have more details.

Outline

Indexes are used in 3 cases:

Filters of the form <expr> = constant.
Differential joins.
Delta joins.

For each of the 3 cases, we have to consider 2 subcases:

A: the cast does not change the underlying binary representation.
B: the cast does change the underlying binary representation.

Case 1A and 1B: Filters of the form `<expr> = constant`.

If the predicate involves a downcast, e.g. castint64toint32(<col>)=1::int32, the entire query will error if any row in <col> cannot be downcast. Thus, it is simply not possible to use an index on (<col>) to speed up this query because we will need to do a full scan of the index in order to see whether or not the query should error.

If the predicate involves an upcast, e.g. castint32toint64(<col>) = 1::int64, the way to allow an index on <col> to be used is to downcast both sides of the equality: <col> = 1::int32. If downcasting the constant results in an error, then instead of propagating the error, we should transform the entire sub-relation into a zero-row constant.

Optimization can theoretically handle the entirety of this case, but there is the issue that the code that selects indexes to use with filters of the form <expr>=constant currently lives in the MIR => LIR code. https://github.com/MaterializeInc/materialize/blob/47cc5960aa004de48fe09bbb5bb174a69f1d86aa/src/dataflow-types/src/plan/mod.rs#L442

I think that the selection of indexes to use with filters of the form <expr>=constant should be moved to be the last transform in the MIR => MIR physical optimization block. (Incidentally, by having all index selection happen in the transform crate, this will unblock https://github.com/MaterializeInc/materialize/issues/4887#issuecomment-845278946.) The MIR => LIR code still needs to keep the code that detects if a filter can be sped up using an index since filter evaluation using an index is rendered differently from normal filter evaluation.

Case 2B & Case 3B: Joins where columns with different underlying representations are compared

This is conceptually easy. Joins are already designed to run filters and projects between join stages. The following diagram shows how differential join currently works.

Thus, all we need to do is change everywhere that says "Filter + Project" and "Cast + Filter + Project".

Specifically,

If we need to upcast to use an existing arrangement, we just need to upcast the input collection.
If we need to downcast to use an existing arrangement, we need to filter away values that would error before running the downcast. This behavior is already built into dataflow rendering; MapFilterProject evaluates filters before maps if the filter does not depend on a value produced by the map, https://github.com/MaterializeInc/materialize/blob/9611cc69528eee6228fe17e931340b1bb91e5684/src/expr/src/linear.rs#L1317
We should make sure that rows coming out of the join are of the expected types.

However, this requires 1) the optimization team to change join_implementation.rs so it can select an index on <col> when the join constraint specifies cast(<col>). 2) the dataflow team to modify the join rendering code to be able to infer, based on the indexes selected by optimization, in between which stages casts + filters would go.

Both of these tasks are non-trivial. It is likely that if we take on this task, both teams will consider handling delta joins will be a separate task from handling differential joins. Before starting, optimization and dataflow should what optimization should communicate to dataflow. It is possible that MirRelationExpr will have to be modified to convey more detailed information about the join that it currently does; if so, we should also take the opportunity to think about how the modified MirRelationExpr could support #7476.

Case 2A & Case 3A: Joins where columns with the same underlying representations are compared

This is doable entirely on the optimization side: 1) Do the same change to join_implementation.rs as needed for the Case 2B + Case3B 2) Teach the MapFilterProject struct that upcasts that don't change the underlying representation are no-ops and downcasts that change the underlying representation are filters.

aalexandrov commented 2 years ago

This implicit cast may happen even if you cast 'to_match' to become the same type as foo.a.

I am trying to figure out why this happens, see the mzt repo.

aalexandrov commented 2 years ago

One way to communicate whether a type conversion is (a) injective (i.e an upcast), and (b) binary compatible is to use type annotation and a rule-based macro, similar to the way this is defined for unary functions.

frankmcsherry commented 1 year ago

indexes on a column should be usable also for contexts where the column has been cast implicitly to another type when the cast does not lose information (i.e., it is an injective function).

I think this will only be possible (short of some new invention) if the stronger property of "implementations of Hash and Ord are preserved" holds. The information content of the type is not sufficient, we also need to be able to align the location of the data (determined by Hash) and the layout of the data (determined by Ord).

ggevay commented 1 month ago

Case 1A and 1B: Filters of the form <expr> = constant.

This has been solved, see invert_casts_on_expr_eq_literal.

The join cases are being discussed here.

MaterializeInc / materialize