PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.94k stars 218 forks source link

How to represent `DISTINCT` #292

Closed max-sixty closed 2 years ago

max-sixty commented 2 years ago

How should we best represent distinct?

I would probably vote for it as part of an aggregate transform:

from employees
aggregate first_name by:first_name

which is equivalent to:

select first_name
from employees
group by first_name

But is there something that an actual DISTINCT does that isn't covered by this approach?

aljazerzen commented 2 years ago

I think this covers all cases (in SQL spec at least).

In practice, in some databases (I believe MySQL) GROUP BY will return ordered rows. Additionally, some databases can optimize DISTINCT queries better than GROUP BY, but in others it the other way around. That's why I think we will eventually have to support some way of expressing both of them.

Regarding the language: while in effect this is very similar to aggregation/grouping, the thought process is quite different. But I think its better that we have only one way of doing something, so we should not add new keywords for operation our language already supports.

Another example of a distinct query: getting list of families from list of users:

from users
aggregate by:[last_name, address]

or

from users
select [last_name, address]
aggregate by:*

Why do we have aggregate instead of group?

from users
group by:[last_name, address]

... would make more sense. We can think of arguments as:

from salaries
group by:employee_id aggregate:[sum salary]

... but omit named arg for aggregate.

PS: You could actually drop one first_name in aggregate in your query:

from employees
aggregate by:first_name
max-sixty commented 2 years ago

Agree with this!

On the one question:

Why do we have aggregate instead of group?

Mostly because group by doesn't contain the actual aggregation in SQL, and I wanted to ensure that wasn't confusing. (And I think that distinction is a great thing about PRQL; Hadley Wickham called it one of his greatest regrets about dplyr in the HN discussion).

max-sixty commented 2 years ago

I just tried adding docs for this, but I think recently this changed:

from employees
aggregate by:first_name [first_name]
Error: 
   ╭─[:2:26]
   │
 2 │ aggregate by:first_name [first_name]
   ·                          ─────┬────  
   ·                               ╰────── Ambiguous variable. Could be from either of {"employees", "$"}
───╯

I think this should work.

And without the positional arg (which I think with the existing definition is still required, because positional args are not optional:

from employees
aggregate by:first_name 
Error: `aggregate` expected only one argument, but found none

WDYT @aljazerzen ?

aljazerzen commented 2 years ago

I think that aggregate should include by parameter in the frame (as it does now), so there is no need to specify it again in the list. So it would be like your second example:

from employees
aggregate by:first_name

Note that, when we introduce typing (even for functions), we should also constrain the positional of the aggregate to allow only "multi-variate functions" (many values to one), so specifying just a column would be invalid type.

Actions needed:

max-sixty commented 2 years ago

remove the requirement of aggregate to have the positional, default to empty list

I'm fine trying this for a while. But it violates the rules for user functions (https://lang.prql.builders/functions.html), and it's more coherent if we don't widen the split between user & compiler functions...

aljazerzen commented 2 years ago

With proposed changes from #300, I think that distinct should look like:

from users
group [last_name, address] (
  take 1
)

and for case of #9 (for each customer return order with the highest total):

from purchases
group customer_id [
  sort desc:total
  take 1
]
max-sixty commented 2 years ago

and for case of #9 (for each customer return order with the highest total):

from purchases
group customer_id [
  sort desc:total
  take 1
]

Definitely. This is so nice.

With proposed changes from #300, I think that distinct should look like:

from users
group [last_name, address] (
  take 1
)

I think this could work. It's quite imperative, relative to declarative — i.e. if someone said "get me the unique last names & addresses", I could tell them "group all the values into last name & address, and then select one from each group". But it's less simple than their request.

It does fit with the proposed model nicely, though. And there are going to be lots of times where we have to trade off Breadth — more abstractions (like a distinct transformation) — and Simplicity — fewer abstractions (like having to describe the transformation here).

aljazerzen commented 2 years ago

That is true, PRQL as a whole is getting more imperative than declarative.

declarative: "get me this and that from there groupped by this"

imperative: "from there, select this and that and group by this"

We also have only one way of doing this - in SQL you can use DISTINCT or do a GROUP BY and this makes queries harder to understand, because you have to look for the same operation at two different places. C++ has similar problem of "let's support everything", which adopting a codebase in C++ a hard task, because things like initialization is done in a different way than what you are used to. But both are correct!

So I think it is beneficial if we strive to have only one way of expressing any operation.

Even though PRQL does not read like a sentence, I do believe it is already more readable than SQL.

max-sixty commented 2 years ago

Closed by #503 🙏