Closed bkamins closed 4 years ago
So I was surprised to see the good old by
functionality deprecated — largely because it had been such a central point in the prior docs. I do really like the new strategy here, but I miss the simplicity of having a single function that did what I most commonly wanted.
It's hard to compete with the terseness of by(df, :x, :y=>sum)
during explorations — and even in scripts, where I'd just add in explicit output names. Given that by
is so easily supported by the straightforward composition of combine ∘ groupby
, I'd lean towards keeping it around.
If we kept by
then do you prefer:
by(df, :x, :y=>sum)
or something like
by(df, :y=>sum, cols=:x)
(the latter is more consistent with the general design strategy we have to separate key operations form metadata in args/kwargs, but for exploratory work maybe the former is preferred)
I think the simplest option is to just introduce some simple aliases for the functions as they exist now in 0.21. To this end, I suggestion the following
combineby
, selectby
and transformby
.combineby(df, cols, args...; kwargs...) = combine(groupby(df, cols; kwargs...), args...; kwargs...)
(of course you'd need to split the kwargs
.by
, since it seems less explicit for no good reason, and might obfuscate the fact that there are other ways to do it. However, if the consensus is to bring back by
also, my suggestion would be to bring it back in addition to combineby
, selectby
, transformby
and then just do const by = combineby
.(the latter is more consistent with the general design strategy we have to separate key operations form metadata in args/kwargs, but for exploratory work maybe the former is preferred)
If we think of this as a simple alias or "shortcut" like I'm suggesting here, than the former would be consistent. The verbosity of the latter option seems undesirable to me.
I don't think we should add combineby
, selectby
and transformby
. We have managed to simplify the API by dropping aggregate
, we shouldn't add so many functions. OTOH I wouldn't be opposed to keeping by
, as groupby+combine
is the most natural combination. Though the name isn't very explicit, and dplyr has managed to live without it even though it's used a lot interactively.
So I think we have three options:
by
combineby
, selectby
, selectby!
, transfrormby
, and transfrormby!
Given the feedback I vote for :+1:. Can you also please vote on the options using the associated emojis under this post? (just to keep the volume of posts lower as I guess all was said regarding the options)
Another alternative would be to only introduce combineby
. Given that we now have 5 reducer functions, this would make it much clearer which one of those by
actually is.
I think I agree that the 5 functions is too much (in my earlier comment I forgot about the mutating versions), but I also think having combine
, select
and transform
and then calling this function by
is a little confusing. Unless the plan was for it to combine functionality of all of those? (If this is even possible, I suspect it would also be rather confusing, so perhaps best not to.)
Thoughts on the arguments? Personally I would prefer by(df, cols, :x=>f)
to by(df, :x=>f, cols=cols)
, which I think is consistent if thought of as an alias for combine \circ groupby
. What's the rationale for the latter?
I prefer transformby
, combineby
etc, over a keyword argument, but I am happy with the current groupby
syntax.
by
so much. by
this group
, do
these things. transform(df,
:x => mean,
[:z, :g, :col] => make_index,
Between(:a, :z) => ByRow(sum),
by = :group)
It's frustrating to read all that only to find out that the operations listed are by group. The grouping is important and you should let the reader know at the start that it is being grouped.
@transform(df, a = :y)
keyword argument syntax. My view is that it’s worth keeping the API as simple as possible, ie we should not add by back. IMO, what’s important is to add better support for piping, which will be provided in DatFramesMeta.
My initial reaction was simply a combination of two things:
It wasn't a big enough of a deal to open an issue (in my mind), but it was enough that I threw an offhand comment at the #data slack. I was also curious of the rationale — I can get behind most any change if I understand the reasoning. Initially, this appeared to just be longer for verbosity's sake (i.e., it's good to be more explicit), but I now see how much simpler this new design is (for both usage and teaching).
I really don't like the idea of adding all five of combineby
and friends. If anything, I'd advocate for eventually repurposing the newly freed by
to become the new name for groupby
:
combine(by(df, :x), :y=>sum)
Glad you agree :). I also like the groupby -> by renaming.
Yeah, I do think that eventually renaming groupby
to by
might be nice.
Alright, now that I've been using this much of the day, I think I'm getting it and I don't think I'm in favor of introducing any aliases anymore.
The major issue for me is that I now understand why @bkamins did not want to do by(df, cols, args...)
... perhaps it makes some kind of sense as an alias, but the inconsistency with calls to the other functions just seems bizarre, this would be very confusing indeed. By the time you get around to introducing by(df, args...; cols=cols)
it just doesn't seem worth it any more.
I think maybe we should (at most) eventually change groupby
to by
, but I see now that adding aliases doesn't make a whole lot of sense.
I'm in favor of the status quo:
gdf = groupby(df, :a)
combine(gdf, :x => sum)
This shows my dplyr bias. As @matthieugomez mentioned, this reads nicely when you have a pipe:
using Lazy: @>
@> begin
df
groupby(:a)
combine(:x => sum)
end
which is essentially equivalent to the dplyr syntax:
df %>%
group_by(a) %>%
summarize(x_sum = sum(x))
I like the let
and take
flavors from https://github.com/gdemin/maditr. I regard it as the best tabular data wrangling in R (data.table + tidyverse + extras).
I agree that the grouping in a transform
call should not be hidden away in the last keyword argument, as I also favor syntax that reads naturally like a sentence. In my opinion that reduces the cognitive load of the users. I like typing few characters, but I always found by
a little too terse, especially because it felt like this sort of groupby operation was somehow encouraged, while I personally often need the transform
style. So I don't think the by
function should give special treatment to combine
.
I have a different proposal. How about shifting the by
keyword forward in transform
, combine
, etc. like this:
struct GroupbySpecifier
columns # stores column names
end
# by creates a GroupbySpecifier
by(columns...) = GroupbySpecifier(columns...)
# now we make dispatch versions of `transform` etc. where a groupby specifier
# in second position does `groupby` then `transform`
transform(df, spec::GroupbySpecifier, args...; kwargs...) =
transform(groupby(df, spec.columns), args...; kwargs...)
This would give us these syntax examples:
transform(df, by(:x), :y => sum)
combine(df, by(:x), :y => sum)
select(df, by(:x), :y => sum)
How about shifting the
by
keyword forward
It is an interesting idea, and it would not require us to change anything in the current design (we would just unwarp it in pre-processing). It could even allow to perform grouping by transformed values, as noting would stop us from allowing e.g. transform(df, by([:x, :y] => ByRow(==) => :g), ...)
. Then probably it should be called By
not by
, but this is a detail.
The only problem is handling of keyword arguments: sort
, skipmissing
, copycols
, keepkeys
, and ungroup
, but probably we could allow all in the general signatures, and just throw an error if they are passed in the context where they are disallowed.
A small detail to be decided is that now when we process group-by we always put grouping columns first in the output. But probably we could keep it as-is, even after introducing By
.
I have tried out a couple more ideas, and maybe it could be perceived as weird to have another function by
when it's so close to groupby
. Of course it's less to type. Always the question what's worth more, saving keystrokes or saving mental space by using fewer keywords.
But the following would also be an option, where we introduce a groupby
variant without a DataFrame
as the first argument, which would create a closure. Then we'd just dispatch on Function
s as the second argument which are supposed to return GroupedDataFrame
s.
@inline DataFrames.groupby(args...; kwargs...) = df::DataFrame -> groupby(df, args...; kwargs...)
DataFrames.combine(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
DataFrames.combine(
groupbyfunc(df)::GroupedDataFrame,
args...; kwargs...)
DataFrames.select(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
DataFrames.select(
groupbyfunc(df)::GroupedDataFrame,
args...; kwargs...)
DataFrames.transform(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
DataFrames.transform(
groupbyfunc(df)::GroupedDataFrame,
args...; kwargs...)
That syntax would look like this in use:
df = DataFrame(team = rand('a':'c', 200), color = rand([:blue, :red], 200), score = randn(200))
combine(df, groupby(:team), :score => sum)
select(df, groupby(:team), :score => sum)
transform(df, groupby(:team), :score => sum)
combine(df, groupby([:team, :color], sort = true), :score .=> [sum, mean, median])
Actually, now that I'm writing this down, this idea extends to all the other DataFrame functions. Wouldn't it be really easy to skip the Pipe.jl
macro syntax and just create closures out of all transformation functions that don't have a DataFrame
or some other specifically overloaded type as their first argument?
groupby(args...; kwargs...) = df -> groupby(df, args...; kwargs...)
combine(args...; kwargs...) = df -> combine(df, args...; kwargs...)
transform(args...; kwargs...) = df -> transform(df, args...; kwargs...)
select(args...; kwargs...) = df -> select(df, args...; kwargs...)
Then pipes would become a little cleaner, although I don't know if it's bad to create so many anonymous functions.
df |>
groupby(:x) |>
transform(:y => sum) |>
groupby(:z) |>
combine(:q => median)
My feeling that the last proposal would go too far for two reasons:
By
wrapperHowever, the idea in general is interesting and it would be worth to experiment with this approach in a companion package (this is exactly why we want to decouple DataFramesBase.jl so that different high-level APIs can be introduced on top of low-level API provided by it).
just to comment. The essential difference between
combine(df, By(:x), ...)
and the current
combine(groupby(df, :x), ...)
is two-fold:
groupby
to by
in the future) - I will comment on this below in detailsBy
- I am not 100% sure if it is worth to add this option as it would put an additional mental burden on the users (and as I have said - we want to keep the "reference" querying framework accessible to entry-level users)Now as for the proposal to replace groupby
by by
as a name:
by
was used in the past for something else, so we should give legacy code users some time to adjust)groupby
is and by
is not (we could potentially rename groupby
to group
if we wanted to follow the verb-based rule here, but group
seems to be a far too common variable name to use; similarly by removing by
in the long run we free up by
as a name that users could use for their variable names, which I think might be useful)So all in all - maybe we should just keep what we have now in 0.21.0 release in the reference querying framework? (and even resist the temptation of shortening groupby
to by
as a name as it really saves only a few keystrokes -> the run-time of the query will be longer than the time to type these additional keystrokes for sure anyway :smile:).
I agree with @bkamins that the status quo is the best solution. groupby
is nice and explicit, and incidentally it's the same name as dplyr for the same operation. Renaming it to by
would also confuse users who bump on old docs which mention the old by
, and users who haven't ported their code in time to 0.21.
I agree going back and forth on the functionality for a deprecated keyword is bound to confuse people. It's a good choice to keep the base package simple. I do sometimes worry that most people will just stick to the base package. If that only offers a more verbose syntax, that will become the default how the package is seen, even if there are more elegant solutions in companion packages.
On the other hand, I also feel that my second proposal is too far-reaching. However this turns out, I think that the changes in 0.21 already improved the package by a lot, and will make most of the meta packages unnecessary, which I think is great. (And how nice is it, that in Julia I could just inject my proposed syntax into the package in five lines or so, and everything immediately worked perfectly?)
The option with the smallest footprint and biggest readability improvement is probably an infix operator for groupby
. lets say we have
julia> ×(df, x) = groupby(df, x) # × kind of reads like "by" maybe?
julia> transform(df × :a, :b => mean) # transform df by :a...
This probably shouldn't live in DataFrames but it's pretty convenient.
×(df, x)
This is a nice idea. It would probably cover 99% of cases (as most of the time people do not use skipmissing
and sort
kwargs).
If we add it it should be added it DataFrames.jl due to type piracy reasons. Actually a natural operator to use is |
as it would be similar to formula interface. What do you think?
I agree infix is nice here, but I'd recommend against punning on an already-defined operator, and ×
is quite strongly the crossproduct in my mind.
If we add it it should be added it DataFrames.jl due to type piracy reasons.
Not necessarily — another package could happily introduce a new (unused) name for this. Some less well trod territory could be /̶
⧷
or ⨸
— which could be seen as splitting or dividing into groups.
If there is a way to define \by[TAB]
and have it produce the infix operator we want, then it wouldn't matter as much exactly which one to use. Is that feasible?
Hmm, I'm not too big a fan of the infix operator approach, to be honest. I prefer
combine(df, By(:a), :b => mean)
even though it requires adding methods to transform
et al.
even though it requires adding methods to transform et al.
I does not require adding methods (but it requires additional internal logic of course).
the infix operator approach
And I like it (but this is just a personal preference so I do not want to enforce it, as I have haid df | cols
reads very cleanly to me - it is exactly the same as it would read in statistics). Also df | cols
would be most easy to type of all options discussed.
If there is a way to define \by[TAB] and have it produce the infix operator we want, then it wouldn't matter as much exactly which one to use. Is that feasible?
Hm, I wonder if any other packages have ever added their own completions; it's as easy as REPL.REPLCompletions.latex_symbols["\\by"] = "/̶"
. The downside is that it requires a dependency on REPL
and runtime initialization of a global, which I really wouldn't want to see in DataFrames. We could potentially petition for a builtin \by
completion (it'd have to be 1.6 now), but it'd have to pass the Unicoders gauntlet. As a non-ASCII alias, it'd just be an alias.
combine(df, By(:a), :b => mean)
I really don't find this any more compelling than a kwarg — note that you can put kwargs at any point in a method call, including before other positional args.
julia> combine(df, fs...; by) = println("combine(groupby($df, $(sprint(show, by))), $((fs...)))")
combine (generic function with 1 method)
julia> combine("df", by=:y, :x=>sum)
combine(groupby(df, :y), :x => sum)
Edit: but to be clear, I'm quite content with the status quo.
note that you can put kwargs at any point in a method call, including before other positional args.
Wow, I really thought that you couldn't do that! :joy:
I guess I could get used to
combine(df | :a, :b => mean)
Though for folks who are less exposed to statistics notation, I'm not sure that it's intuitively obvious what df | :a
means.
However, given @mbauman's revelation about keyword arguments, it seems like simply adding the by
keyword argument is the best solution.
Ok, now to advocate for what I actually would like to see (for what it's worth as a non-DataFrames dev), as opposed to just talking about what's technically possible:
|
for groupby
; it goes against the docstring for |
.by
keyword in the combine/select[!]/transform[!] operator functions. I'm marginally against it existing at all; as I've worked more with this I've liked these new thought patterns better.groupby
, especially if we can find a good symbol and manage to get a \by
alias into the REPL stdlib.I kind of like the | idea, of course it's different from what the docstring says but I think it's not a big deal. Shouldn’t the heuristic be that operators should not suggest that they do something different than they actually do? And I don’t see how you could think that this does an or operation in a data frame context. I'd say humans are very good at context dependent thinking, as all the word doublings suggest that don't really trip people up.
There are tons of unicode characters to bikeshed over if we decide to go that route. More unicode characters if we decide to declare one as infix with \by[TAB]
. We should probably make a separate issue for that.
I love unicode characters, but I'm not in favor of abusing existing operations like |
, since the existing methods for that have a totally different meaning. It very much goes against the spirit of multiple dispatch.
I'm also very much against adding custom unicode aliases like \by
. The whole reason that unicode works so nicely is that it's consistent with LaTeX, if we break with that, things can get real ugly real quick. (I suspect most people over at the julia reop would be on my side with this anyway.) There are tons of existing options for a unicode alias for a function.
I haven't thought about it that much, but at first glance, I think I'd be ok with a by
keyword for the transformation functions.
Anyway, I agree with @mbauman in that the new API seems really well thought out, and I think @bkamins and @nalimilan probably spent a lot of time giving it careful consideration, and much of what we're saying here is probably a knee-jerk reaction to change. I'm in favor of minor changes to make common operations more convenient, but I think we should be careful to leave the carefully considered rules of the current API very much intact.
Fwiw, I like ÷
. It's like you are dividing the data frame in to it's groups.
Fwiw, I like ÷. It's like you are dividing the data frame in to it's groups.
Why not just use /
for the groupby infix operator? If we identify a column name with an equivalence relation, groupby is essentially the same as computing the quotient set.
To me, this looks nice
combine(df / :a, :b => mean)
A slight variation to the infix idea is to getindex, like df[|, :a]
, df[/, :a]
or df[~, :a]
. DataFrames.jl already uses !
in a bit unconventional way. Maybe it's kind of nice to call df[|, end-1:end]
to group by last two columns.
The issue raised here gets mixed opinions - different people would find a bit different things best. Fortunately the additions are not contradicting the current design and it is easy enough to add "one lines" to user's code to get some more flexibility.
Therefore I close this issue to keep the API exposed by DataFrames.jl to minimum and let extensions define additional behaviours.
If someone disagrees please comment or reopen.
Given that people read 1/4
as "one by four", I don't think /
is a bad idea at all. Thanks @CarloLucibello for the suggestion
Now we have two separate operations:
groupby
creates aGroupedDataFrame
combine
/select
/select!
/transform
/transform!
that operate onGroupedDataFrame
to produce a result.Old
by
hasgroupby
+combine
.Please add your opinions here if we should add the shorthand functions for the operation done in sequence and if yes what is the proposed API. Thank you!