MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Correlated inner join may result in a three-way cross join #22463

Open aalexandrov opened 10 months ago

aalexandrov commented 10 months ago

Problem

In the HIR ⇒ MIR lowering phase, decorrelated equi-joins that refer to the outer context in their join conditions will produce decorrelated plans that will result in sub-optimal join order with unnecessary cross join stages.

For example, the OPTIMIZED PLAN for the following query:

SELECT * FROM outer JOIN LATERAL (
  SELECT *
  FROM
    facts
    JOIN dim01 ON(x * facts.dim01_k01 = y * dim01.dim01_k01)
) ON(true);

looks as follows

Explained Query:
  Join on=((x * dim01_k01) = (y * dim01_k01)) type=delta // { arity: 17 }
    implementation
      %0:outer » %1:facts[×]A » %2:dim01[×]A
      %1:facts » %0:outer[×]A » %2:dim01[×]A
      %2:dim01 » %0:outer[×]A » %1:facts[×]A
    ArrangeBy keys=[[]] // { arity: 2 }
      ReadStorage materialize.left_joins_raw.outer // { arity: 2 }
    ArrangeBy keys=[[]] // { arity: 9 }
      Filter (dim01_k01) IS NOT NULL // { arity: 9 }
        ReadStorage materialize.left_joins_raw.facts // { arity: 9 }
    ArrangeBy keys=[[]] // { arity: 6 }
      ReadStorage materialize.left_joins_raw.dim01 // { arity: 6 }

Source materialize.left_joins_raw.facts
  filter=((dim01_k01) IS NOT NULL)

but the decorrelated plan looks as follows:

Return // { arity: 17 }
  Filter true // { arity: 17 }
    Project (#0, #1, #4..=#18) // { arity: 17 }
      Join on=(x = x AND y = y) // { arity: 19 }
        Get l0 // { arity: 2 }
        Filter ((x * dim01_k01) = (y * dim01_k01)) // { arity: 17 }
          Project (#0..=#10, #13..=#18) // { arity: 17 }
            Join on=(x = x AND y = y) // { arity: 19 }
              CrossJoin // { arity: 11 }
                Get l1 // { arity: 2 }
                Get materialize.left_joins_raw.facts // { arity: 9 }
              CrossJoin // { arity: 8 }
                Get l1 // { arity: 2 }
                Get materialize.left_joins_raw.dim01 // { arity: 6 }
With
  cte l1 =
    Distinct group_by=[x, y] // { arity: 2 }
      Get l0 // { arity: 2 }
  cte l0 =
    CrossJoin // { arity: 2 }
      Constant // { arity: 0 }
        - ()
      Get materialize.left_joins_raw.outer // { arity: 2 }

If we don't de-dupe the Get materialize.left_joins_raw.outer part after decorrelation we can end up with a plan that has two cross joins (lhs = outer x facts and rhs = outer x dim01), but can then do an equi-join between lhs and rhs.

Solution

Unclear.

Example schema

CREATE TABLE outer(
    x int not null,
    y int not null
);

CREATE TABLE facts(
    facts_k01 int not null,
    dim01_k01 int,
    dim02_k01 int,
    dim03_k01 int,
    facts_d01 int,
    facts_d02 int,
    facts_d03 int,
    facts_d04 int,
    facts_d05 int
);

CREATE TABLE dim01(
    dim01_k01 int,
    dim01_d01 int,
    dim01_d02 int,
    dim01_d03 int,
    dim01_d04 int,
    dim01_d05 int
);

See also

aalexandrov commented 10 months ago

@mgree / @ggevay I think this falls under the "Improved join planning" category. Do we have an umbrella issue for that?