Open wangandi opened 2 years ago
The current workaround for users is that they should call explain <query>
and see if there are implicit casts in the plan. If yes, create an index for column::<type that column has been cast into>
.
Will edit this comment as I have more details.
Indexes are used in 3 cases:
<expr> = constant
.For each of the 3 cases, we have to consider 2 subcases:
<expr> = constant
.If the predicate involves a downcast, e.g. castint64toint32(<col>)=1::int32
, the entire query will error if any row in <col>
cannot be downcast. Thus, it is simply not possible to use an index on (<col>)
to speed up this query because we will need to do a full scan of the index in order to see whether or not the query should error.
If the predicate involves an upcast, e.g. castint32toint64(<col>) = 1::int64
, the way to allow an index on <col>
to be used is to downcast both sides of the equality: <col> = 1::int32
. If downcasting the constant results in an error, then instead of propagating the error, we should transform the entire sub-relation into a zero-row constant.
Optimization can theoretically handle the entirety of this case, but there is the issue that the code that selects indexes to use with filters of the form <expr>=constant
currently lives in the MIR => LIR code. https://github.com/MaterializeInc/materialize/blob/47cc5960aa004de48fe09bbb5bb174a69f1d86aa/src/dataflow-types/src/plan/mod.rs#L442
I think that the selection of indexes to use with filters of the form <expr>=constant
should be moved to be the last transform in the MIR => MIR physical optimization block. (Incidentally, by having all index selection happen in the transform crate, this will unblock https://github.com/MaterializeInc/materialize/issues/4887#issuecomment-845278946.) The MIR => LIR code still needs to keep the code that detects if a filter can be sped up using an index since filter evaluation using an index is rendered differently from normal filter evaluation.
This is conceptually easy. Joins are already designed to run filters and projects between join stages. The following diagram shows how differential join currently works.
Thus, all we need to do is change everywhere that says "Filter + Project" and "Cast + Filter + Project".
Specifically,
MapFilterProject
evaluates filters before maps if the filter does not depend on a value produced by the map, https://github.com/MaterializeInc/materialize/blob/9611cc69528eee6228fe17e931340b1bb91e5684/src/expr/src/linear.rs#L1317However, this requires
1) the optimization team to change join_implementation.rs so it can select an index on <col>
when the join constraint specifies cast(<col>)
.
2) the dataflow team to modify the join rendering code to be able to infer, based on the indexes selected by optimization, in between which stages casts + filters would go.
Both of these tasks are non-trivial. It is likely that if we take on this task, both teams will consider handling delta joins will be a separate task from handling differential joins. Before starting, optimization and dataflow should what optimization should communicate to dataflow. It is possible that MirRelationExpr
will have to be modified to convey more detailed information about the join that it currently does; if so, we should also take the opportunity to think about how the modified MirRelationExpr
could support #7476.
This is doable entirely on the optimization side:
1) Do the same change to join_implementation.rs as needed for the Case 2B + Case3B
2) Teach the MapFilterProject
struct that upcasts that don't change the underlying representation are no-ops and downcasts that change the underlying representation are filters.
This implicit cast may happen even if you cast 'to_match' to become the same type as
foo.a
.
I am trying to figure out why this happens, see the mzt
repo.
One way to communicate whether a type conversion is (a) injective (i.e an upcast), and (b) binary compatible is to use type annotation and a rule-based macro, similar to the way this is defined for unary functions.
indexes on a column should be usable also for contexts where the column has been cast implicitly to another type when the cast does not lose information (i.e., it is an injective function).
I think this will only be possible (short of some new invention) if the stronger property of "implementations of Hash
and Ord
are preserved" holds. The information content of the type is not sufficient, we also need to be able to align the location of the data (determined by Hash
) and the layout of the data (determined by Ord
).
TL;DR
For the convenience of the user, indexes on a column should be usable also for contexts where the column has been cast implicitly to another type when the cast does not lose information (i.e., it is an injective function).
Description
When a user enters any SQL query:
The SQL layer may decide to implicitly cast
foo.a
so that it is the same type as 'to_match'. This implicit cast may happen even if you cast 'to_match' to become the same type asfoo.a
.Example:
Currently, if you create an index on a column, the index cannot be used in contexts where it has been implicitly cast to something else. In the query above, you would need an index on
varcharcol::string
instead of an index onvarcharcol
to speed up the filter. But this is not readily apparent to the user.Linked Tickets