Closed max-sixty closed 2 years ago
I think this covers all cases (in SQL spec at least).
In practice, in some databases (I believe MySQL) GROUP BY will return ordered rows. Additionally, some databases can optimize DISTINCT queries better than GROUP BY, but in others it the other way around. That's why I think we will eventually have to support some way of expressing both of them.
Regarding the language: while in effect this is very similar to aggregation/grouping, the thought process is quite different. But I think its better that we have only one way of doing something, so we should not add new keywords for operation our language already supports.
Another example of a distinct query: getting list of families from list of users:
from users
aggregate by:[last_name, address]
or
from users
select [last_name, address]
aggregate by:*
Why do we have aggregate
instead of group
?
from users
group by:[last_name, address]
... would make more sense. We can think of arguments as:
from salaries
group by:employee_id aggregate:[sum salary]
... but omit named arg for aggregate
.
PS: You could actually drop one first_name
in aggregate
in your query:
from employees
aggregate by:first_name
Agree with this!
On the one question:
Why do we have
aggregate
instead ofgroup
?
Mostly because group by
doesn't contain the actual aggregation in SQL, and I wanted to ensure that wasn't confusing. (And I think that distinction is a great thing about PRQL; Hadley Wickham called it one of his greatest regrets about dplyr in the HN discussion).
I just tried adding docs for this, but I think recently this changed:
from employees
aggregate by:first_name [first_name]
Error:
╭─[:2:26]
│
2 │ aggregate by:first_name [first_name]
· ─────┬────
· ╰────── Ambiguous variable. Could be from either of {"employees", "$"}
───╯
I think this should work.
And without the positional arg (which I think with the existing definition is still required, because positional args are not optional:
from employees
aggregate by:first_name
Error: `aggregate` expected only one argument, but found none
WDYT @aljazerzen ?
I think that aggregate
should include by
parameter in the frame (as it does now), so there is no need to specify it again in the list.
So it would be like your second example:
from employees
aggregate by:first_name
Note that, when we introduce typing (even for functions), we should also constrain the positional of the aggregate to allow only "multi-variate functions" (many values to one), so specifying just a column would be invalid type.
Actions needed:
remove the requirement of aggregate to have the positional, default to empty list
I'm fine trying this for a while. But it violates the rules for user functions (https://lang.prql.builders/functions.html), and it's more coherent if we don't widen the split between user & compiler functions...
With proposed changes from #300, I think that distinct should look like:
from users
group [last_name, address] (
take 1
)
and for case of #9 (for each customer return order with the highest total):
from purchases
group customer_id [
sort desc:total
take 1
]
and for case of #9 (for each customer return order with the highest total):
from purchases group customer_id [ sort desc:total take 1 ]
Definitely. This is so nice.
With proposed changes from #300, I think that distinct should look like:
from users group [last_name, address] ( take 1 )
I think this could work. It's quite imperative, relative to declarative — i.e. if someone said "get me the unique last names & addresses", I could tell them "group all the values into last name & address, and then select one from each group". But it's less simple than their request.
It does fit with the proposed model nicely, though. And there are going to be lots of times where we have to trade off Breadth — more abstractions (like a distinct
transformation) — and Simplicity — fewer abstractions (like having to describe the transformation here).
That is true, PRQL as a whole is getting more imperative than declarative.
declarative: "get me this and that from there groupped by this"
imperative: "from there, select this and that and group by this"
We also have only one way of doing this - in SQL you can use DISTINCT or do a GROUP BY and this makes queries harder to understand, because you have to look for the same operation at two different places. C++ has similar problem of "let's support everything", which adopting a codebase in C++ a hard task, because things like initialization is done in a different way than what you are used to. But both are correct!
So I think it is beneficial if we strive to have only one way of expressing any operation.
Even though PRQL does not read like a sentence, I do believe it is already more readable than SQL.
Closed by #503 🙏
How should we best represent
distinct
?I would probably vote for it as part of an
aggregate
transform:which is equivalent to:
But is there something that an actual
DISTINCT
does that isn't covered by this approach?