PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.65k stars 208 forks source link

Intersect and Difference Operators #761

Open snth opened 2 years ago

snth commented 2 years ago

As requested in #656, it would be nice if PRQL could provide intersect and difference operators along with the union operator.

intersect and difference can be achieved with semi joins and anti joins. I'm not sure how widely supported those are but there are correlated subquery patterns with EXISTS that can work around this. This will require backend adaptor specific code though so I think we should rather address this in a separate issue in order to keep the union implementation simple. I will open an issue for those.

References:

https://en.wikipedia.org/wiki/Relational_algebra#Semijoin_(%E2%8B%89)(%E2%8B%8A)

https://en.wikipedia.org/wiki/Relational_algebra#Antijoin_.28.E2.96.B7.29

https://link.springer.com/chapter/10.1007/978-1-4302-3229-2_11

max-sixty commented 2 years ago

I think we can do these with standard INTERSECT operators — along with https://github.com/prql/prql/issues/656? Or are these not well-supported?

e.g. Postgres https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-intersect/

pbsds commented 1 year ago

Leaving this here, for performance reasons. https://www.crunchydata.com/blog/rise-of-the-anti-join

aljazerzen commented 1 year ago

I was thinking how to best implement union and came to realization, that many of the operations can be translated down a small set of operators in relational algebra.

For example our current approach for distinct/unique is:

func unique tbl -> group [identity] (take 1) tbl

This representation with group + take is just intermediate and is unpacked into actual SELECT DISTINCT when translating. But the point is that we keep RQ (the intermediate stage) simple so it's debuggable and easy to implement for new backends.

That said union, intersect and difference can all be implemented only by adding concat operator (dplyr calls it bind_rows, pandas calls it concatenate):

concat a b
-> UNION ALL

union a b
 = concat a b | unique
-> UNION

difference a b =
 = join a [identity] b side:left | filter b.identity == null | select a.*
-> SELECT a.* FROM a LEFT JOIN b ON ... WHERE b.id IS NULL

intersection a b
 = join a [identity] b side:inner | select a.*
-> SELECT a.* FROM a INNER JOIN b ON ...

union, difference and intersection can be defined in std lib as normal functions that resolve to using derivations above. The best things is that when someone writes:

from a
select x, y
concat b
group [x, y] (take 1)

... this will translate to UNION, because it will have the same RQ representation as using union a b.

Additionally, it gives us the freedom to adapt sql backend to different databases. If difference of intersection implementations via JOIN prove to have bad performance on some db engines, we can add preprocessing stage that converts them to EXCLUDE and INTERSECT.


I'm also considering changing representation of JOIN (and maybe its SQL translation):

join a [a.id==b.id] b
 ~ join a b | filter a.id==b.id
-> SELECT ... FROM a, b WHERE a.id = b.id

But I cannot find a way to express LEFT, RIGHT and OUTER JOIN with only FROM and WHERE.

max-sixty commented 1 year ago

I was thinking how to best implement union and came to realization, that many of the operations can be translated down a small set of operators in relational algebra.

Very elegant! I hadn't thought about it like this before, but I think you're absolutely right. This would be a really nice interface, and can still be translated to SQL.

I'm definitely up for this

max-sixty commented 1 year ago

I'm also considering changing representation of JOIN (and maybe its SQL translation):

join a [a.id==b.id] b
 ~ join a b | filter a.id==b.id
-> SELECT ... FROM a, b WHERE a.id = b.id

I'm familiar with this approach — I think it was standard in Oracle DBs a while ago (or at least I've seen folks from that world use it)

My understanding is that it's much less common now, since DBs are worse at optimizing it (though this deserves a fact-check, it doesn't seem too difficult to optimize...), and it hides some of the query's intent — first join, then filter — rather than starting with a full cross-product and then filtering.

So iff my assumptions are correct, I would vote to keep JOIN. OTOH, if we can show that there aren't performance impacts and it simplies the compiler, then +1 from me.

fdr commented 1 year ago

I'm a big fan of the anti-join, I think it's a pity that that SQL makes it syntactically asymmetric with the other kinds of join (especially since NOT IN is so different when NULLs get involved). It very nicely complements the idea of a regular join, which is often a filter on correlated tuple existence, anti-joins are very neatly the inverse.

I really wish I could write "ANTI JOIN" in SQL and use it like any of the other joins. A number of executor people also use the anti-join nomenclature, I can't think of anyone who thinks of it as purely a way of using correlated subqueries, and indeed detecting the anti-join is a popular optimization target.

max-sixty commented 1 year ago

I think we can probably close this now #894 is closed

(not withstanding the discussion re joins & @fdr 's comment, no intention to cut that short)

aljazerzen commented 1 year ago

No, keep it open. Intesection, difference are not done yet.

snth commented 1 year ago

I like you approach @aljazerzen . It's very cool that you can detect the patterns in the RQ, regardless of how they were generated, and then produce the SQL from that. I would love to understand how you do that. In which file/module/part of the compilation process do you do that?


One possible problem I see with your proposed SQL implementations though is that where you have [identity] in the snippets below, in practice this identity will be * (i.e. all columns) so how do you generate the required ON and WHERE clauses? Even if you had access to the table definition and all column names, how efficient would that be?

Therefore for the SQL translation phase it might still be easier to use EXCEPT / INTERSECT or NOT EXISTS / EXISTS. A related blog post Rise of the anti-join.

difference a b = = join a [identity] b side:left | filter b.identity == null | select a. -> SELECT a. FROM a LEFT JOIN b ON ... WHERE b.id IS NULL

intersection a b = join a [identity] b side:inner | select a. -> SELECT a. FROM a INNER JOIN b ON ...

aljazerzen commented 1 year ago

In which file/module/part of the compilation process do you do that?

A bit all over the place, this is the central one.


One possible problem I see

That's a valid concern. I'm not sure yet how i'll do that.

how efficient would that be

Plan is to compare all columns with each other in ON clause. This would work if all columns are known and is the simplest solution rn. I think modern optimizers will kick-in and replace comparison of all columns with comparison of only ids, so it shouldn't be any slower than other approaches.

But for unknown columns this won't work. For that, I'll probably resort to EXCEPT / INTERSECT, as that are supposedly more performant according to the linked post.

aljazerzen commented 1 year ago

A few questions:

snth commented 1 year ago

I was going to raise a question about the names as well.

I think at least for intersection we could go with the shorter intersect which is still valid English and a verb.

difference is also too long for my liking which goes a bit against our ergonomics aim. I was thinking we could maybe go with diff? Not proper English but at least quite common in programming circles? Otherwise what about remove? That seems like it would be the inverse operation of append, both in English and in our usage in PRQL. If you append and then remove the same relation you should mostly get back to your original table (some edge cases with duplicates aside).

Off-topic: My other bugbears are aggregate and derive but I haven't been able to think of appropriate shorter versions.

aljazerzen commented 1 year ago

intersect and remove are ok with me.

Correction: Postgres supports ALL keyword for all three operators. SQLite supports it only for UNION.

If we go with remove, it is not obvious that this is a set operation. Also, we want operations to be as orthogonal as possible, which means not bundling many operations into single function. So I'd vote for not implicitly applying DISTINCT.

snth commented 1 year ago

Looks like DuckDB follows SQLite here.

If I understand the Postgres docs correctly, then EXCEPT ALL does actually the opposite of what I would expect (just from the English), i.e. (pseudocode for brevity) [1,2,2] EXCEPT ALL [2] returns [1,2] while [1,2,2] EXCEPT [2] returns [1].

Why don't we just use except then? And use remove for EXCEPT ALL where that is supported? This would complete the following table:

add subtract
distinct union except
not distinct append remove
aljazerzen commented 1 year ago

I like the table a lot.

SQL behavior is (again) a bit confusing, but if I understand correctly, this is the behavior:

a UNION b         = a | append b | distinct
a UNION ALL b     = a | append b
a EXCEPT b        = a | distinct | remove b
a EXCEPT ALL b    = a | remove b
a INTERSECT b     = a | intersect b | distinct
a INTERSECT ALL b = a | intersect b

note that EXCEPT has distinct before remove

snth commented 1 year ago

I was curious about the a EXCEPT b part whether it would also apply the DISTINCT to the rows not removed. Looks like it does in DuckDB so your map of the behaviour looks correct:

image

max-sixty commented 1 year ago

Just to briefly comment that this sounds good, thanks for working through the concepts.

One question is whether we need two functions for each of Adding and Subtracting, or for each case, one is just the other with | distinct on. Arguments for this:

OTOH:

aljazerzen commented 1 year ago

You make a good point - SQL behavior around "when to apply the distinct" is not the only obvious behavior. And it may not the expected behavior for some problems.

Not having union&expect would be in spirit of "only one way to do it" and "few orthogonal operations".

So my vote now is to have only append & remove, but provide the docs on how to construct your own distinct, union & except.

snth commented 1 year ago

I agree with this and also had the thought when we were drawing up the table that we're now adding a lot of keywords. Totally supportive of having fewer, composable, orthogonal operations, especially if the append | distinct and distinct | remove combinations can be detected in the RQ translation stage and elided into the equivalent SQL (as was commented on here https://github.com/PRQL/prql/issues/761#issuecomment-1333519245 and here https://github.com/PRQL/prql/issues/761#issuecomment-1375742612).

aljazerzen commented 1 year ago

Ok, it looks we have an agreement. This is now a matter of implementation.

aljazerzen commented 1 year ago

We have a problem: the implementation of remove (and intersect) I posted above does not handle duplicates correctly:

func remove bottom top -> (
  top
  join bottom top.* == bottom.*
  filter bottom.* == null
  select top.*
)

If top = [1, 2, 2, 3] and bottom = [1, 2], value in the pipeline after join is:

| top | bottom |
|-----|--------|
| 1   | 1      |
| 2   | 2      |
| 2   | 2      |
| 3   | null   |

... which means that the result contains only [3], but our definition of remove says it should be [2, 3].

If we wanted to do it correctly, this would be the implementation:

func remove bottom top -> (
  top

  # add row numbers within each group
  group top.* (derive _rn = row_number)

  # add row numbers within each group
  join (from bottom | group top.* (derive _rn = row_number)) (top.* == bottom.*)

  filter bottom.* == null
  select top.*
)

Problem with this is that I'll have a hard time robustly detecting this pattern and compile to EXCEPT ALL. Additionally some RDMBS don't even support EXCEPT ALL, but only EXCEPT DISTINCT (or implicit DISTINCT).

Thinking about it, behavior of EXCEPT DISTINCT is more useful than EXCEPT ALL, so it's a bit inconvenient for us to have remove which is equivalent to EXCEPT ALL. So I propose that we either:

There is a similar problem with intersect, but let's talk about that later.

snth commented 1 year ago

Hmm, I'm trying to think what we could do.

From what I understand, we want to limit the number of keywords in PRQL so we opted for distinct | remove b instead of except so that we don't have to introduce the except keyword. What if we still allow EXCEPT in RQ though and just not in PRQL? So when the compiler finds a distinct | remove b in PL it translates that into EXCEPT in RQ, when it finds a remove b without distinct then it produces EXCEPT ALL in RQ (which will produce an error on a backend that doesn't support EXCEPT ALL).

I know we had previously discussed this beautiful mathematical unification of all these operations in terms of simple relational algebra operators but it now appears that this is not so easy to implement in practice so maybe we can adapt our approach a little?

snth commented 1 year ago

Haven't read this yet but looks like it might be relevant: https://www.crunchydata.com/blog/postgres-query-optimization-left-join-vs-union-all

jangorecki commented 9 months ago

+1 for ALL keywords support, very useful for data validation during ETL steps

aljazerzen commented 9 months ago

@jangorecki What do you mean?

append does already compile to UNION ALL.

jangorecki commented 9 months ago

Except all and intersect all. https://rdatatable.gitlab.io/data.table/reference/setops.html Description of all argument in this manual page may be useful. I wrote it so if anything is not clear I can elaborate more. Moreover another page linked from references section in this manual provides a lot of information on the subject.