Open snth opened 2 years ago
I think we can do these with standard INTERSECT
operators — along with https://github.com/prql/prql/issues/656? Or are these not well-supported?
e.g. Postgres https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-intersect/
Leaving this here, for performance reasons. https://www.crunchydata.com/blog/rise-of-the-anti-join
I was thinking how to best implement union
and came to realization, that many of the operations can be translated down a small set of operators in relational algebra.
For example our current approach for distinct
/unique
is:
func unique tbl -> group [identity] (take 1) tbl
This representation with group + take
is just intermediate and is unpacked into actual SELECT DISTINCT
when translating. But the point is that we keep RQ (the intermediate stage) simple so it's debuggable and easy to implement for new backends.
That said union
, intersect
and difference
can all be implemented only by adding concat
operator (dplyr calls it bind_rows
, pandas calls it concatenate
):
concat a b
-> UNION ALL
union a b
= concat a b | unique
-> UNION
difference a b =
= join a [identity] b side:left | filter b.identity == null | select a.*
-> SELECT a.* FROM a LEFT JOIN b ON ... WHERE b.id IS NULL
intersection a b
= join a [identity] b side:inner | select a.*
-> SELECT a.* FROM a INNER JOIN b ON ...
union
, difference
and intersection
can be defined in std lib as normal functions that resolve to using derivations above. The best things is that when someone writes:
from a
select x, y
concat b
group [x, y] (take 1)
... this will translate to UNION
, because it will have the same RQ representation as using union a b
.
Additionally, it gives us the freedom to adapt sql backend to different databases. If difference
of intersection
implementations via JOIN prove to have bad performance on some db engines, we can add preprocessing stage that converts them to EXCLUDE and INTERSECT.
I'm also considering changing representation of JOIN (and maybe its SQL translation):
join a [a.id==b.id] b
~ join a b | filter a.id==b.id
-> SELECT ... FROM a, b WHERE a.id = b.id
But I cannot find a way to express LEFT, RIGHT and OUTER JOIN with only FROM and WHERE.
I was thinking how to best implement
union
and came to realization, that many of the operations can be translated down a small set of operators in relational algebra.
Very elegant! I hadn't thought about it like this before, but I think you're absolutely right. This would be a really nice interface, and can still be translated to SQL.
I'm definitely up for this
I'm also considering changing representation of JOIN (and maybe its SQL translation):
join a [a.id==b.id] b ~ join a b | filter a.id==b.id -> SELECT ... FROM a, b WHERE a.id = b.id
I'm familiar with this approach — I think it was standard in Oracle DBs a while ago (or at least I've seen folks from that world use it)
My understanding is that it's much less common now, since DBs are worse at optimizing it (though this deserves a fact-check, it doesn't seem too difficult to optimize...), and it hides some of the query's intent — first join, then filter — rather than starting with a full cross-product and then filtering.
So iff my assumptions are correct, I would vote to keep JOIN
. OTOH, if we can show that there aren't performance impacts and it simplies the compiler, then +1 from me.
I'm a big fan of the anti-join, I think it's a pity that that SQL makes it syntactically asymmetric with the other kinds of join (especially since NOT IN is so different when NULLs get involved). It very nicely complements the idea of a regular join, which is often a filter on correlated tuple existence, anti-joins are very neatly the inverse.
I really wish I could write "ANTI JOIN" in SQL and use it like any of the other joins. A number of executor people also use the anti-join nomenclature, I can't think of anyone who thinks of it as purely a way of using correlated subqueries, and indeed detecting the anti-join is a popular optimization target.
I think we can probably close this now #894 is closed
(not withstanding the discussion re joins & @fdr 's comment, no intention to cut that short)
No, keep it open. Intesection, difference are not done yet.
I like you approach @aljazerzen . It's very cool that you can detect the patterns in the RQ, regardless of how they were generated, and then produce the SQL from that. I would love to understand how you do that. In which file/module/part of the compilation process do you do that?
One possible problem I see with your proposed SQL implementations though is that where you have [identity]
in the snippets below, in practice this identity
will be *
(i.e. all columns) so how do you generate the required ON
and WHERE
clauses? Even if you had access to the table definition and all column names, how efficient would that be?
Therefore for the SQL translation phase it might still be easier to use EXCEPT / INTERSECT
or NOT EXISTS / EXISTS
. A related blog post Rise of the anti-join.
difference a b = = join a [identity] b side:left | filter b.identity == null | select a. -> SELECT a. FROM a LEFT JOIN b ON ... WHERE b.id IS NULL
intersection a b = join a [identity] b side:inner | select a. -> SELECT a. FROM a INNER JOIN b ON ...
In which file/module/part of the compilation process do you do that?
A bit all over the place, this is the central one.
One possible problem I see
That's a valid concern. I'm not sure yet how i'll do that.
how efficient would that be
Plan is to compare all columns with each other in ON clause. This would work if all columns are known and is the simplest solution rn. I think modern optimizers will kick-in and replace comparison of all columns with comparison of only ids, so it shouldn't be any slower than other approaches.
But for unknown columns this won't work. For that, I'll probably resort to EXCEPT / INTERSECT
, as that are supposedly more performant according to the linked post.
A few questions:
union
intersection
and difference
ok?intersection
and difference
also apply distinct
? In SQL INTERSECT does and I may be ok with that. These are set operations afterall.I was going to raise a question about the names as well.
I think at least for intersection
we could go with the shorter intersect
which is still valid English and a verb.
difference
is also too long for my liking which goes a bit against our ergonomics aim. I was thinking we could maybe go with diff
? Not proper English but at least quite common in programming circles? Otherwise what about remove
? That seems like it would be the inverse operation of append
, both in English and in our usage in PRQL. If you append
and then remove
the same relation you should mostly get back to your original table (some edge cases with duplicates aside).
Off-topic: My other bugbears are aggregate
and derive
but I haven't been able to think of appropriate shorter versions.
intersect
and remove
are ok with me.
Correction: Postgres supports ALL keyword for all three operators. SQLite supports it only for UNION.
If we go with remove
, it is not obvious that this is a set operation. Also, we want operations to be as orthogonal as possible, which means not bundling many operations into single function. So I'd vote for not implicitly applying DISTINCT.
Looks like DuckDB follows SQLite here.
If I understand the Postgres docs correctly, then EXCEPT ALL
does actually the opposite of what I would expect (just from the English), i.e. (pseudocode for brevity) [1,2,2] EXCEPT ALL [2]
returns [1,2]
while [1,2,2] EXCEPT [2]
returns [1]
.
Why don't we just use except
then? And use remove
for EXCEPT ALL
where that is supported? This would complete the following table:
add | subtract | |
---|---|---|
distinct | union |
except |
not distinct | append |
remove |
I like the table a lot.
SQL behavior is (again) a bit confusing, but if I understand correctly, this is the behavior:
a UNION b = a | append b | distinct
a UNION ALL b = a | append b
a EXCEPT b = a | distinct | remove b
a EXCEPT ALL b = a | remove b
a INTERSECT b = a | intersect b | distinct
a INTERSECT ALL b = a | intersect b
note that EXCEPT has distinct before remove
I was curious about the a EXCEPT b
part whether it would also apply the DISTINCT
to the rows not removed. Looks like it does in DuckDB so your map of the behaviour looks correct:
Just to briefly comment that this sounds good, thanks for working through the concepts.
One question is whether we need two functions for each of Adding and Subtracting, or for each case, one is just the other with | distinct
on. Arguments for this:
UNION DISTINCT
removed duplicate rows that existed within a table prior. Only having append | distinct
rather than union
makes this obvious.OTOH:
You make a good point - SQL behavior around "when to apply the distinct" is not the only obvious behavior. And it may not the expected behavior for some problems.
Not having union
&expect
would be in spirit of "only one way to do it" and "few orthogonal operations".
So my vote now is to have only append
& remove
, but provide the docs on how to construct your own distinct
, union
& except
.
I agree with this and also had the thought when we were drawing up the table that we're now adding a lot of keywords. Totally supportive of having fewer, composable, orthogonal operations, especially if the append | distinct
and distinct | remove
combinations can be detected in the RQ translation stage and elided into the equivalent SQL (as was commented on here https://github.com/PRQL/prql/issues/761#issuecomment-1333519245 and here https://github.com/PRQL/prql/issues/761#issuecomment-1375742612).
Ok, it looks we have an agreement. This is now a matter of implementation.
We have a problem: the implementation of remove (and intersect) I posted above does not handle duplicates correctly:
func remove bottom top -> (
top
join bottom top.* == bottom.*
filter bottom.* == null
select top.*
)
If top = [1, 2, 2, 3]
and bottom = [1, 2]
, value in the pipeline after join is:
| top | bottom |
|-----|--------|
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 3 | null |
... which means that the result contains only [3]
, but our definition of remove
says it should be [2, 3]
.
If we wanted to do it correctly, this would be the implementation:
func remove bottom top -> (
top
# add row numbers within each group
group top.* (derive _rn = row_number)
# add row numbers within each group
join (from bottom | group top.* (derive _rn = row_number)) (top.* == bottom.*)
filter bottom.* == null
select top.*
)
Problem with this is that I'll have a hard time robustly detecting this pattern and compile to EXCEPT ALL. Additionally some RDMBS don't even support EXCEPT ALL, but only EXCEPT DISTINCT (or implicit DISTINCT).
Thinking about it, behavior of EXCEPT DISTINCT is more useful than EXCEPT ALL, so it's a bit inconvenient for us to have remove
which is equivalent to EXCEPT ALL. So I propose that we either:
remove
to work like EXCEPT DISTINCT,remove
, but keep the docs for how to define exclude yourself and SQL backend code to translate to EXCEPT DISTINCT.There is a similar problem with intersect, but let's talk about that later.
Hmm, I'm trying to think what we could do.
From what I understand, we want to limit the number of keywords in PRQL so we opted for distinct | remove b
instead of except
so that we don't have to introduce the except
keyword. What if we still allow EXCEPT in RQ though and just not in PRQL? So when the compiler finds a distinct | remove b
in PL it translates that into EXCEPT
in RQ, when it finds a remove b
without distinct
then it produces EXCEPT ALL
in RQ (which will produce an error on a backend that doesn't support EXCEPT ALL).
I know we had previously discussed this beautiful mathematical unification of all these operations in terms of simple relational algebra operators but it now appears that this is not so easy to implement in practice so maybe we can adapt our approach a little?
Haven't read this yet but looks like it might be relevant: https://www.crunchydata.com/blog/postgres-query-optimization-left-join-vs-union-all
+1 for ALL keywords support, very useful for data validation during ETL steps
@jangorecki What do you mean?
append
does already compile to UNION ALL
.
Except all and intersect all.
https://rdatatable.gitlab.io/data.table/reference/setops.html
Description of all
argument in this manual page may be useful. I wrote it so if anything is not clear I can elaborate more. Moreover another page linked from references section in this manual provides a lot of information on the subject.
As requested in #656, it would be nice if PRQL could provide
intersect
anddifference
operators along with theunion
operator.intersect
anddifference
can be achieved withsemi joins
andanti joins
. I'm not sure how widely supported those are but there are correlated subquery patterns with EXISTS that can work around this. This will require backend adaptor specific code though so I think we should rather address this in a separate issue in order to keep the union implementation simple. I will open an issue for those.References:
https://en.wikipedia.org/wiki/Relational_algebra#Semijoin_(%E2%8B%89)(%E2%8B%8A)
https://en.wikipedia.org/wiki/Relational_algebra#Antijoin_.28.E2.96.B7.29
https://link.springer.com/chapter/10.1007/978-1-4302-3229-2_11