MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.65k stars 458 forks source link

Consider using DD's `antijoin` to render outer joins #10500

Open aalexandrov opened 2 years ago

aalexandrov commented 2 years ago

State of the Art

At the moment, the optimized plan of the for the following query:

SELECT
    L.x, L.y, R.y, R.z
FROM
    L LEFT OUTER JOIN R ON L.y = R.y

renders as follows:

optimized-plan

The general pattern is as follows:

  1. Compute the inner join X1 of the two input relations L and R on the join predicate (a_1, .., a_n) = (b_1, .., b_n).
  2. Compute the duplicate eliminating projection of a_1, .., a_n in X1.
  3. Join L with the above result on a_1, .., a_n.
  4. Compute the difference of the above with X1 to find out all elements of R that don't have a match in X1.
  5. Extend the above difference with nulls to match the schema of X1.
  6. Compute union of X1 and the above.

In pseudo-code, this corresponds to:

let p = |(l, r)| (l.a_1, .., l.a_n) == (r.a_1, .., r.a_n);
let X1 = inner_join(L, R).on(p);
let X2 = X1.map(|i| (i.a_1, .., i.a_n).();
let X3 = inner_join(L, X2).on(p);
let X4 = difference(L, X3);
let X5 = X4.map(|r| (r, null, .., null));
let X6 = union(X1, X5)
X6

In Differential Dataflow (DD), we need to arrange:

  1. L on a_1, .., a_n.
  2. R on b_1, .., b_n.
  3. X1 on a_1, .., a_n.

For a full outer join, steps (2)-(5) need to be duplicated with reverse roles, and the final result is the union of X1 with X5 and its symmetric clone. This is very close to the relational algebra definition of outer join given in the Wikipedia article[^wiki-loj].

Proposal

I propose to replace steps (2)-(4) with an antijoin operator[^wiki-aj] which is already present in DD^dd-aj.

In pseudo-code, this corresponds to:

let p = |(l, r)| (l.a_1, .., l.a_n) == (r.a_1, .., r.a_n);
let X1 = inner_join(L, R).on(p);
let X2 = L.antijoin(R).on(p);
let X3 = X2.map(|r| (r, null, .., null));
let X6 = union(X1, X3)
X6

This has the benefit that we don't need the arrangement on X1.

Limitations

The propsal works only for outer joins with equality predicate. Joins with non-equality predicates, such as

SELECT
    L.x, L.y, R.y, R.z
FROM
    L LEFT OUTER JOIN R ON L.y >= R.y

will need to use the old plan.

optimized-plan

Appendix: DB Schema

-- database
CREATE DATABASE outer_joins;
-- schema
CREATE TABLE L(x INT NOT NULL, y INT NOT NULL);
CREATE TABLE R(y INT NOT NULL, z INT NOT NULL);
-- data
INSERT INTO L VALUES (1, 2), (2, 3);
INSERT INTO R VALUES (1, 2), (2, 3);

[^wiki-loj]: Relational algebra: Right outer join (⟖) [^wiki-aj]: Relational algebra: Antijoin (▷)

frankmcsherry commented 2 years ago

Prior discussion of the current plan wrt other plans: https://github.com/MaterializeInc/materialize/issues/2732

I'm not sure if it concludes anything with respect to this proposal, just wanted to link it.