JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

add a shorthand for groupby + function combo #2253

Closed bkamins closed 4 years ago

bkamins commented 4 years ago

Now we have two separate operations:

Old by has groupby+combine.

Please add your opinions here if we should add the shorthand functions for the operation done in sequence and if yes what is the proposed API. Thank you!

mbauman commented 4 years ago

So I was surprised to see the good old by functionality deprecated — largely because it had been such a central point in the prior docs. I do really like the new strategy here, but I miss the simplicity of having a single function that did what I most commonly wanted.

It's hard to compete with the terseness of by(df, :x, :y=>sum) during explorations — and even in scripts, where I'd just add in explicit output names. Given that by is so easily supported by the straightforward composition of combine ∘ groupby, I'd lean towards keeping it around.

bkamins commented 4 years ago

If we kept by then do you prefer:

by(df, :x, :y=>sum)

or something like

by(df, :y=>sum, cols=:x)

(the latter is more consistent with the general design strategy we have to separate key operations form metadata in args/kwargs, but for exploratory work maybe the former is preferred)

ExpandingMan commented 4 years ago

I think the simplest option is to just introduce some simple aliases for the functions as they exist now in 0.21. To this end, I suggestion the following

(the latter is more consistent with the general design strategy we have to separate key operations form metadata in args/kwargs, but for exploratory work maybe the former is preferred)

If we think of this as a simple alias or "shortcut" like I'm suggesting here, than the former would be consistent. The verbosity of the latter option seems undesirable to me.

nalimilan commented 4 years ago

I don't think we should add combineby, selectby and transformby. We have managed to simplify the API by dropping aggregate, we shouldn't add so many functions. OTOH I wouldn't be opposed to keeping by, as groupby+combine is the most natural combination. Though the name isn't very explicit, and dplyr has managed to live without it even though it's used a lot interactively.

bkamins commented 4 years ago

So I think we have three options:

  1. :+1: re-introduce by
  2. :rocket: introduce 5 functions combineby, selectby, selectby!, transfrormby, and transfrormby!
  3. :-1: do nothing

Given the feedback I vote for :+1:. Can you also please vote on the options using the associated emojis under this post? (just to keep the volume of posts lower as I guess all was said regarding the options)

ExpandingMan commented 4 years ago

Another alternative would be to only introduce combineby. Given that we now have 5 reducer functions, this would make it much clearer which one of those by actually is.

I think I agree that the 5 functions is too much (in my earlier comment I forgot about the mutating versions), but I also think having combine, select and transform and then calling this function by is a little confusing. Unless the plan was for it to combine functionality of all of those? (If this is even possible, I suspect it would also be rather confusing, so perhaps best not to.)

Thoughts on the arguments? Personally I would prefer by(df, cols, :x=>f) to by(df, :x=>f, cols=cols), which I think is consistent if thought of as an alias for combine \circ groupby. What's the rationale for the latter?

pdeffebach commented 4 years ago

I prefer transformby, combineby etc, over a keyword argument, but I am happy with the current groupby syntax.

  1. It reads like a sentence, which is why I think people like by so much. by this group, do these things.
  2. Adding a keyword argument will make long commands hard to read. You will run into things like this
transform(df, 
   :x => mean, 
   [:z, :g, :col] => make_index, 
   Between(:a, :z) => ByRow(sum), 
   by = :group)

It's frustrating to read all that only to find out that the operations listed are by group. The grouping is important and you should let the reader know at the start that it is being grouped.

  1. Selfishly, adding a keyword argument might make DataFramesMeta hard, especially if we don't deprecate the @transform(df, a = :y) keyword argument syntax.
matthieugomez commented 4 years ago

My view is that it’s worth keeping the API as simple as possible, ie we should not add by back. IMO, what’s important is to add better support for piping, which will be provided in DatFramesMeta.

mbauman commented 4 years ago

My initial reaction was simply a combination of two things:

It wasn't a big enough of a deal to open an issue (in my mind), but it was enough that I threw an offhand comment at the #data slack. I was also curious of the rationale — I can get behind most any change if I understand the reasoning. Initially, this appeared to just be longer for verbosity's sake (i.e., it's good to be more explicit), but I now see how much simpler this new design is (for both usage and teaching).

I really don't like the idea of adding all five of combineby and friends. If anything, I'd advocate for eventually repurposing the newly freed by to become the new name for groupby:

combine(by(df, :x), :y=>sum)
matthieugomez commented 4 years ago

Glad you agree :). I also like the groupby -> by renaming.

ExpandingMan commented 4 years ago

Yeah, I do think that eventually renaming groupby to by might be nice.

ExpandingMan commented 4 years ago

Alright, now that I've been using this much of the day, I think I'm getting it and I don't think I'm in favor of introducing any aliases anymore.

The major issue for me is that I now understand why @bkamins did not want to do by(df, cols, args...)... perhaps it makes some kind of sense as an alias, but the inconsistency with calls to the other functions just seems bizarre, this would be very confusing indeed. By the time you get around to introducing by(df, args...; cols=cols) it just doesn't seem worth it any more.

I think maybe we should (at most) eventually change groupby to by, but I see now that adding aliases doesn't make a whole lot of sense.

CameronBieganek commented 4 years ago

I'm in favor of the status quo:

gdf = groupby(df, :a)
combine(gdf, :x => sum)

This shows my dplyr bias. As @matthieugomez mentioned, this reads nicely when you have a pipe:

using Lazy: @>

@> begin
    df
    groupby(:a)
    combine(:x => sum)
end

which is essentially equivalent to the dplyr syntax:

df %>%
    group_by(a) %>%
    summarize(x_sum = sum(x))
Nosferican commented 4 years ago

I like the let and take flavors from https://github.com/gdemin/maditr. I regard it as the best tabular data wrangling in R (data.table + tidyverse + extras).

jkrumbiegel commented 4 years ago

I agree that the grouping in a transform call should not be hidden away in the last keyword argument, as I also favor syntax that reads naturally like a sentence. In my opinion that reduces the cognitive load of the users. I like typing few characters, but I always found by a little too terse, especially because it felt like this sort of groupby operation was somehow encouraged, while I personally often need the transform style. So I don't think the by function should give special treatment to combine.

I have a different proposal. How about shifting the by keyword forward in transform, combine, etc. like this:

struct GroupbySpecifier
    columns # stores column names
end

# by creates a GroupbySpecifier
by(columns...) = GroupbySpecifier(columns...)

# now we make dispatch versions of `transform` etc. where a groupby specifier
# in second position does `groupby` then `transform`
transform(df, spec::GroupbySpecifier, args...; kwargs...) =
    transform(groupby(df, spec.columns), args...; kwargs...)

This would give us these syntax examples:

transform(df, by(:x), :y => sum)
combine(df, by(:x), :y => sum)
select(df, by(:x), :y => sum)
bkamins commented 4 years ago

How about shifting the by keyword forward

It is an interesting idea, and it would not require us to change anything in the current design (we would just unwarp it in pre-processing). It could even allow to perform grouping by transformed values, as noting would stop us from allowing e.g. transform(df, by([:x, :y] => ByRow(==) => :g), ...). Then probably it should be called By not by, but this is a detail.

The only problem is handling of keyword arguments: sort, skipmissing, copycols, keepkeys, and ungroup, but probably we could allow all in the general signatures, and just throw an error if they are passed in the context where they are disallowed.

A small detail to be decided is that now when we process group-by we always put grouping columns first in the output. But probably we could keep it as-is, even after introducing By.

jkrumbiegel commented 4 years ago

I have tried out a couple more ideas, and maybe it could be perceived as weird to have another function by when it's so close to groupby. Of course it's less to type. Always the question what's worth more, saving keystrokes or saving mental space by using fewer keywords.

But the following would also be an option, where we introduce a groupby variant without a DataFrame as the first argument, which would create a closure. Then we'd just dispatch on Functions as the second argument which are supposed to return GroupedDataFrames.

@inline DataFrames.groupby(args...; kwargs...) = df::DataFrame -> groupby(df, args...; kwargs...)

DataFrames.combine(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
    DataFrames.combine(
        groupbyfunc(df)::GroupedDataFrame,
        args...; kwargs...)

DataFrames.select(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
    DataFrames.select(
        groupbyfunc(df)::GroupedDataFrame,
        args...; kwargs...)

DataFrames.transform(df::DataFrame, groupbyfunc::Function, args...; kwargs...) =
    DataFrames.transform(
        groupbyfunc(df)::GroupedDataFrame,
        args...; kwargs...)

That syntax would look like this in use:

df = DataFrame(team = rand('a':'c', 200), color = rand([:blue, :red], 200), score = randn(200))

combine(df, groupby(:team), :score => sum)
select(df, groupby(:team), :score => sum)
transform(df, groupby(:team), :score => sum)

combine(df, groupby([:team, :color], sort = true), :score .=> [sum, mean, median])

Actually, now that I'm writing this down, this idea extends to all the other DataFrame functions. Wouldn't it be really easy to skip the Pipe.jl macro syntax and just create closures out of all transformation functions that don't have a DataFrame or some other specifically overloaded type as their first argument?

groupby(args...; kwargs...) = df -> groupby(df, args...; kwargs...)
combine(args...; kwargs...) = df -> combine(df, args...; kwargs...)
transform(args...; kwargs...) = df -> transform(df, args...; kwargs...)
select(args...; kwargs...) = df -> select(df, args...; kwargs...)

Then pipes would become a little cleaner, although I don't know if it's bad to create so many anonymous functions.

df |>
    groupby(:x) |>
    transform(:y => sum) |>
    groupby(:z) |>
    combine(:q => median)
bkamins commented 4 years ago

My feeling that the last proposal would go too far for two reasons:

  1. many people do not use piping style
  2. DataFrames.jl is intended as entry-level package, we should not expose too many advanced concepts to the users (I know this is restrictive but this is the way I feel)
  3. I do not see (but please comment if I missed something) much benefit over the earlier syntax you have proposed with By wrapper

However, the idea in general is interesting and it would be worth to experiment with this approach in a companion package (this is exactly why we want to decouple DataFramesBase.jl so that different high-level APIs can be introduced on top of low-level API provided by it).

bkamins commented 4 years ago

just to comment. The essential difference between

combine(df, By(:x), ...)

and the current

combine(groupby(df, :x), ...)

is two-fold:

  1. it saves some typing (but this can be worked around by renaming groupby to by in the future) - I will comment on this below in details
  2. it would allow to use transformations oin By - I am not 100% sure if it is worth to add this option as it would put an additional mental burden on the users (and as I have said - we want to keep the "reference" querying framework accessible to entry-level users)

Now as for the proposal to replace groupby by by as a name:

  1. It would require a deprecation cycle (it is not the end of the world, but in this case the cycle should be long, as by was used in the past for something else, so we should give legacy code users some time to adjust)
  2. there was a lot of discussions that function names should be preferably "verb"-based, which groupby is and by is not (we could potentially rename groupby to group if we wanted to follow the verb-based rule here, but group seems to be a far too common variable name to use; similarly by removing by in the long run we free up by as a name that users could use for their variable names, which I think might be useful)

So all in all - maybe we should just keep what we have now in 0.21.0 release in the reference querying framework? (and even resist the temptation of shortening groupby to by as a name as it really saves only a few keystrokes -> the run-time of the query will be longer than the time to type these additional keystrokes for sure anyway :smile:).

nalimilan commented 4 years ago

I agree with @bkamins that the status quo is the best solution. groupby is nice and explicit, and incidentally it's the same name as dplyr for the same operation. Renaming it to by would also confuse users who bump on old docs which mention the old by, and users who haven't ported their code in time to 0.21.

jkrumbiegel commented 4 years ago

I agree going back and forth on the functionality for a deprecated keyword is bound to confuse people. It's a good choice to keep the base package simple. I do sometimes worry that most people will just stick to the base package. If that only offers a more verbose syntax, that will become the default how the package is seen, even if there are more elegant solutions in companion packages.

On the other hand, I also feel that my second proposal is too far-reaching. However this turns out, I think that the changes in 0.21 already improved the package by a lot, and will make most of the meta packages unnecessary, which I think is great. (And how nice is it, that in Julia I could just inject my proposed syntax into the package in five lines or so, and everything immediately worked perfectly?)

pdeffebach commented 4 years ago

The option with the smallest footprint and biggest readability improvement is probably an infix operator for groupby. lets say we have

julia> ×(df, x) = groupby(df, x) # × kind of reads like "by" maybe?
julia> transform(df × :a, :b => mean) # transform df by :a...

This probably shouldn't live in DataFrames but it's pretty convenient.

bkamins commented 4 years ago

×(df, x)

This is a nice idea. It would probably cover 99% of cases (as most of the time people do not use skipmissing and sort kwargs).

If we add it it should be added it DataFrames.jl due to type piracy reasons. Actually a natural operator to use is | as it would be similar to formula interface. What do you think?

mbauman commented 4 years ago

I agree infix is nice here, but I'd recommend against punning on an already-defined operator, and × is quite strongly the crossproduct in my mind.

If we add it it should be added it DataFrames.jl due to type piracy reasons.

Not necessarily — another package could happily introduce a new (unused) name for this. Some less well trod territory could be or — which could be seen as splitting or dividing into groups.

pdeffebach commented 4 years ago

If there is a way to define \by[TAB] and have it produce the infix operator we want, then it wouldn't matter as much exactly which one to use. Is that feasible?

CameronBieganek commented 4 years ago

Hmm, I'm not too big a fan of the infix operator approach, to be honest. I prefer

combine(df, By(:a), :b => mean)

even though it requires adding methods to transform et al.

bkamins commented 4 years ago

even though it requires adding methods to transform et al.

I does not require adding methods (but it requires additional internal logic of course).

the infix operator approach

And I like it (but this is just a personal preference so I do not want to enforce it, as I have haid df | cols reads very cleanly to me - it is exactly the same as it would read in statistics). Also df | cols would be most easy to type of all options discussed.

mbauman commented 4 years ago

If there is a way to define \by[TAB] and have it produce the infix operator we want, then it wouldn't matter as much exactly which one to use. Is that feasible?

Hm, I wonder if any other packages have ever added their own completions; it's as easy as REPL.REPLCompletions.latex_symbols["\\by"] = "/̶". The downside is that it requires a dependency on REPL and runtime initialization of a global, which I really wouldn't want to see in DataFrames. We could potentially petition for a builtin \by completion (it'd have to be 1.6 now), but it'd have to pass the Unicoders gauntlet. As a non-ASCII alias, it'd just be an alias.

combine(df, By(:a), :b => mean)

I really don't find this any more compelling than a kwarg — note that you can put kwargs at any point in a method call, including before other positional args.

julia> combine(df, fs...; by) = println("combine(groupby($df, $(sprint(show, by))), $((fs...)))")
combine (generic function with 1 method)

julia> combine("df", by=:y, :x=>sum)
combine(groupby(df, :y), :x => sum)

Edit: but to be clear, I'm quite content with the status quo.

CameronBieganek commented 4 years ago

note that you can put kwargs at any point in a method call, including before other positional args.

Wow, I really thought that you couldn't do that! :joy:

CameronBieganek commented 4 years ago

I guess I could get used to

combine(df | :a, :b => mean)

Though for folks who are less exposed to statistics notation, I'm not sure that it's intuitively obvious what df | :a means.

However, given @mbauman's revelation about keyword arguments, it seems like simply adding the by keyword argument is the best solution.

mbauman commented 4 years ago

Ok, now to advocate for what I actually would like to see (for what it's worth as a non-DataFrames dev), as opposed to just talking about what's technically possible:

jkrumbiegel commented 4 years ago

I kind of like the | idea, of course it's different from what the docstring says but I think it's not a big deal. Shouldn’t the heuristic be that operators should not suggest that they do something different than they actually do? And I don’t see how you could think that this does an or operation in a data frame context. I'd say humans are very good at context dependent thinking, as all the word doublings suggest that don't really trip people up.

pdeffebach commented 4 years ago

There are tons of unicode characters to bikeshed over if we decide to go that route. More unicode characters if we decide to declare one as infix with \by[TAB]. We should probably make a separate issue for that.

ExpandingMan commented 4 years ago

I love unicode characters, but I'm not in favor of abusing existing operations like |, since the existing methods for that have a totally different meaning. It very much goes against the spirit of multiple dispatch.

I'm also very much against adding custom unicode aliases like \by. The whole reason that unicode works so nicely is that it's consistent with LaTeX, if we break with that, things can get real ugly real quick. (I suspect most people over at the julia reop would be on my side with this anyway.) There are tons of existing options for a unicode alias for a function.

I haven't thought about it that much, but at first glance, I think I'd be ok with a by keyword for the transformation functions.

Anyway, I agree with @mbauman in that the new API seems really well thought out, and I think @bkamins and @nalimilan probably spent a lot of time giving it careful consideration, and much of what we're saying here is probably a knee-jerk reaction to change. I'm in favor of minor changes to make common operations more convenient, but I think we should be careful to leave the carefully considered rules of the current API very much intact.

pdeffebach commented 4 years ago

Fwiw, I like ÷. It's like you are dividing the data frame in to it's groups.

CarloLucibello commented 4 years ago

Fwiw, I like ÷. It's like you are dividing the data frame in to it's groups.

Why not just use / for the groupby infix operator? If we identify a column name with an equivalence relation, groupby is essentially the same as computing the quotient set. To me, this looks nice

combine(df / :a, :b => mean)
tkf commented 4 years ago

A slight variation to the infix idea is to getindex, like df[|, :a], df[/, :a] or df[~, :a]. DataFrames.jl already uses ! in a bit unconventional way. Maybe it's kind of nice to call df[|, end-1:end] to group by last two columns.

bkamins commented 4 years ago

The issue raised here gets mixed opinions - different people would find a bit different things best. Fortunately the additions are not contradicting the current design and it is easy enough to add "one lines" to user's code to get some more flexibility.

Therefore I close this issue to keep the API exposed by DataFrames.jl to minimum and let extensions define additional behaviours.

If someone disagrees please comment or reopen.

sa- commented 3 years ago

Given that people read 1/4 as "one by four", I don't think / is a bad idea at all. Thanks @CarloLucibello for the suggestion