At the moment, the optimized plan of the for the following query:
SELECT
L.x, L.y, R.y, R.z
FROM
L LEFT OUTER JOIN R ON L.y = R.y
renders as follows:
The general pattern is as follows:
Compute the inner join X1 of the two input relations L and R on the join predicate (a_1, .., a_n) = (b_1, .., b_n).
Compute the duplicate eliminating projection of a_1, .., a_n in X1.
Join L with the above result on a_1, .., a_n.
Compute the difference of the above with X1 to find out all elements of R that don't have a match in X1.
Extend the above difference with nulls to match the schema of X1.
Compute union of X1 and the above.
In pseudo-code, this corresponds to:
let p = |(l, r)| (l.a_1, .., l.a_n) == (r.a_1, .., r.a_n);
let X1 = inner_join(L, R).on(p);
let X2 = X1.map(|i| (i.a_1, .., i.a_n).();
let X3 = inner_join(L, X2).on(p);
let X4 = difference(L, X3);
let X5 = X4.map(|r| (r, null, .., null));
let X6 = union(X1, X5)
X6
In Differential Dataflow (DD), we need to arrange:
L on a_1, .., a_n.
R on b_1, .., b_n.
X1 on a_1, .., a_n.
For a full outer join, steps (2)-(5) need to be duplicated with reverse roles, and the final result is the union of X1 with X5 and its symmetric clone. This is very close to the relational algebra definition of outer join given in the Wikipedia article[^wiki-loj].
Proposal
I propose to replace steps (2)-(4) with an antijoin operator[^wiki-aj] which is already present in DD^dd-aj.
In pseudo-code, this corresponds to:
let p = |(l, r)| (l.a_1, .., l.a_n) == (r.a_1, .., r.a_n);
let X1 = inner_join(L, R).on(p);
let X2 = L.antijoin(R).on(p);
let X3 = X2.map(|r| (r, null, .., null));
let X6 = union(X1, X3)
X6
This has the benefit that we don't need the arrangement on X1.
Limitations
The propsal works only for outer joins with equality predicate. Joins with non-equality predicates, such as
SELECT
L.x, L.y, R.y, R.z
FROM
L LEFT OUTER JOIN R ON L.y >= R.y
will need to use the old plan.
Appendix: DB Schema
-- database
CREATE DATABASE outer_joins;
-- schema
CREATE TABLE L(x INT NOT NULL, y INT NOT NULL);
CREATE TABLE R(y INT NOT NULL, z INT NOT NULL);
-- data
INSERT INTO L VALUES (1, 2), (2, 3);
INSERT INTO R VALUES (1, 2), (2, 3);
State of the Art
At the moment, the optimized plan of the for the following query:
renders as follows:
The general pattern is as follows:
X1
of the two input relationsL
andR
on the join predicate(a_1, .., a_n) = (b_1, .., b_n)
.a_1, .., a_n
inX1
.L
with the above result ona_1, .., a_n
.X1
to find out all elements ofR
that don't have a match inX1
.X1
.X1
and the above.In pseudo-code, this corresponds to:
In Differential Dataflow (DD), we need to arrange:
L
ona_1, .., a_n
.R
onb_1, .., b_n
.X1
ona_1, .., a_n
.For a full outer join, steps (2)-(5) need to be duplicated with reverse roles, and the final result is the union of X1 with X5 and its symmetric clone. This is very close to the relational algebra definition of outer join given in the Wikipedia article[^wiki-loj].
Proposal
I propose to replace steps (2)-(4) with an
antijoin
operator[^wiki-aj] which is already present in DD^dd-aj.In pseudo-code, this corresponds to:
This has the benefit that we don't need the arrangement on
X1
.Limitations
The propsal works only for outer joins with equality predicate. Joins with non-equality predicates, such as
will need to use the old plan.
Appendix: DB Schema
[^wiki-loj]: Relational algebra: Right outer join (⟖) [^wiki-aj]: Relational algebra: Antijoin (▷)