An investigation of JOB revealed that the reason optd output 1 for all queries was that the selectivities of redundant predicates are all evaluated. A set of predicate contains redundant predicates if some of them can be expressed by others. For example, consider query 1a.
SELECT mc.note AS production_note,
t.title AS movie_title,
t.production_year AS movie_year
FROM company_type AS ct,
info_type AS it,
movie_companies AS mc,
movie_info_idx AS mi_idx,
title AS t
WHERE ct.kind = 'production companies'
AND it.info = 'top 250 rank'
AND mc.note NOT LIKE '%(as Metro-Goldwyn-Mayer Pictures)%'
AND (mc.note LIKE '%(co-production)%'
OR mc.note LIKE '%(presents)%')
AND ct.id = mc.company_type_id
AND t.id = mc.movie_id
AND t.id = mi_idx.movie_id
AND mc.movie_id = mi_idx.movie_id
AND it.id = mi_idx.info_type_id;
In the join plan node with t as one of the input tables, the predicates would contain both t.id = mc.movie_id and t.id = mi_idx.movie_id. But since there's already mc.movie_id = mi_idx.movie_id, one of these three predicates is redundant.
Goal
Given a set of predicates P that define the equality of N predicates, we want to pick N - 1 most selective predicates P' and remove the rest.
Implementation
Identifying Equal Columns
We identify the set of predicates in GroupColumnRefs logical property using union find, because this way we can reuse the logic of base table column ref to identify which columns are equal.
❗ However, BaseTableColumnRef does not handle table alias, so if t1 and t2 are both aliases for t, t1.a = t2.b and t1.b = t2.a will be treated as the same predicate.
Computing Selecitivty for P'
The difficulty is that we don't get to see all the predicates all at once. Instead, we see those predicates gradually as we move up the plan tree, and the child might have already picked some predicate that is not among P'. E.g. in the above example, even if mc.movie_id = mi_idx.movie_id (denoted p3) is the most selective predicate and thus should have been eliminated, it will be picked by mc join mi_idx because it's the only predicate it sees.
Therefore, the two predicates involving t (denote t.id = mc.movie_id as p1 and t.id = mi_idx.movie_id as p2) should introduce some "selectivity adjustment factor". More specifically, if we see p1 first we can just pick it, since p1 and p3 are not redundant. When we see p2, we need to pick two of the three predicates, but remember p3 and p1 are already picked. So in this case instead of just multiplying the total selectivity with p2's, p2 provides some sort of selectivity adjustment factor which can be computed by MSP({p1, p2, p3}) / MSP({p1, p3}), where MSP is the multiplied selectivies of the Most Selective Predicates that define the equality of the columns. MSP is computed using Kruskal.
This idea can be generalized to more predicates.
Results
This features significantly improves q-error for the JOB benchmark. Before this PR, the estimated cardinalities for all columns are just 1.
Background
An investigation of JOB revealed that the reason optd output 1 for all queries was that the selectivities of redundant predicates are all evaluated. A set of predicate contains redundant predicates if some of them can be expressed by others. For example, consider query 1a.
In the join plan node with
t
as one of the input tables, the predicates would contain botht.id = mc.movie_id
andt.id = mi_idx.movie_id
. But since there's alreadymc.movie_id = mi_idx.movie_id
, one of these three predicates is redundant.Goal
Given a set of predicates
P
that define the equality ofN
predicates, we want to pickN - 1
most selective predicatesP'
and remove the rest.Implementation
Identifying Equal Columns
We identify the set of predicates in
GroupColumnRefs
logical property using union find, because this way we can reuse the logic of base table column ref to identify which columns are equal.❗ However,
BaseTableColumnRef
does not handle table alias, so ift1
andt2
are both aliases fort
,t1.a = t2.b
andt1.b = t2.a
will be treated as the same predicate.Computing Selecitivty for
P'
The difficulty is that we don't get to see all the predicates all at once. Instead, we see those predicates gradually as we move up the plan tree, and the child might have already picked some predicate that is not among
P'
. E.g. in the above example, even ifmc.movie_id = mi_idx.movie_id
(denotedp3
) is the most selective predicate and thus should have been eliminated, it will be picked bymc
joinmi_idx
because it's the only predicate it sees.Therefore, the two predicates involving
t
(denotet.id = mc.movie_id
asp1
andt.id = mi_idx.movie_id
asp2
) should introduce some "selectivity adjustment factor". More specifically, if we seep1
first we can just pick it, sincep1
andp3
are not redundant. When we seep2
, we need to pick two of the three predicates, but rememberp3
andp1
are already picked. So in this case instead of just multiplying the total selectivity withp2
's,p2
provides some sort of selectivity adjustment factor which can be computed byMSP({p1, p2, p3}) / MSP({p1, p3})
, whereMSP
is the multiplied selectivies of the Most Selective Predicates that define the equality of the columns.MSP
is computed using Kruskal.This idea can be generalized to more predicates.
Results
This features significantly improves q-error for the JOB benchmark. Before this PR, the estimated cardinalities for all columns are just
1
.