Closed matthieugomez closed 4 years ago
This is essentially https://github.com/JuliaData/DataFrames.jl/issues/2172, which will be added later (as it is non-breaking). We leave by
mainly for backward compatibility reasons.
As for mutate
vs summarize
- why do you see that this distinction is needed? If I understand the functionality there correctly we already provide it with a single function (currently select
/transform
and combine
/by
).
EDIT
Just to expand transform(groupby(df, :col1), :col2 => mean)
will keep the result a GroupedDataFrame
(this is also what dplyr does), while by
will transform it to DataFrame
as it does now. So the API will be minimally different.
So - to summarize - unless you have some comment why we should distinguish "single row" vs "all rows" functionality (we currently do not and dplyr does, but I do not see a huge benefit of this distinction) then this issue can be closed and please comment in #2172 on the functionality select
/transform
should have for GroupedDataFrame
(they key discussion is row order of the output as you will note there).
Yes, I think different methods should do single rows vs all rows. The current syntax for the top right quadrant vs bottom right quadrant is not good. It’s very important to separate between the two when doing data wangling.
Also, even looking only at the top quadrant. by
vs transform
is not satisfying. I would rather have:
transform(df::DataFrame, :x => mean)
-> Dataframe
transform(df::DataFrame, :x => mean, by = :id)
-> Dataframe
transform(df::GroupedDataFrame, :x => mean)
-> GroupedDataframe
It’s very important to separate between the two when doing data wangling.
Can you please comment on this more? Why it is so important? My understanding is that it is what the function you apply returns determines the shape of the output.
EDIT in particular this function can neither return 1 row nor that many rows as were in the original group but something completely different.
I would rather have:
This is doable and essentially depends on whether we want to allow by
argument in transform
and can be added in the future.
See this discourse for instance: https://discourse.julialang.org/t/dataframesmeta-jl-and-the-state-of-the-dataframes-ecosystem/36221/7
So essentially you would want to extend keepkeys
kwarg functionality in the following way:
true
: keep groping columns in the original shape (this is what we do not have now and can be easily added)false
: drop gruping columnsnothing
: current behavior and the default (and currently this happens for keepkeys=true
but we could change it)Do I understand your request correctly?
I would like by
to be deprecated and split into two functions aggregate(..., by)
vs tranform(..., by)
.. I dont think I have anything to say wrt keepkeys.
but by(...)
would not be the same as transform(..., by)
but rather the same as select(..., by)
.
Also currently we cannot use aggregate
, because we have to go through a deprecation period.
Finally then aggregate
should be two functions - one pairing select
and another pairing transform
. What names would you use for both functionalities?
So that is why I am asking if it is OK with you if we achieve the functionality you describe using kwargs only that would be added to select
and transform
:
keepkeys
works (this should be done before 0.21 release)by
keyword argument to them and support for passing GroupedDataFrame
(this is for the future), also then probably combine
kwarg should be added to them (with different default depending on if AbstractDataFrame
or GroupedDataFrame
is passed)If we went this way then by
and also combine
should get deprecated. But adding by
kwarg and the consequences probably will not happen in 0.21 release.
just to add - what you write you want is currently available by writing (disregarding some corner cases for a while not to complicate the discussion):
by(df, :key, :key, :some_col => mean)
simply we would make by(df, :key, :some_col => mean, keepkeys=true)
to expand to that form.
I tried to make my point without talking about whether or not you want to keep all other columns. With this added complication, the full table in dplyr is:
transmute(df, mean(x)) | transmute(group_by(df, id), mean(x)) |
---|---|
mutate(df, mean(x)) | mutate(group_by(df, id), mean(x)) |
summarize(d, mean(x)) | summarize(group_by(df, id), mean(x)) |
Note that there is no need for summarize that keeps all existing columns (not sure what you mean by "aggregate should be two functions...").
I am suggesting:
select(df, :x => mean) | select(df, :x => mean, by:id) |
---|---|
transform(df, :x => mean) | transform(df, :x => mean, by = :id) |
aggregate(df, :x => mean) | aggregate(df, :x => mean, by = :id) |
So in short, I would like by
to deprecated, and split between select(..., by)
and aggregate(.., by)
(the name aggregate
is just an example, it can be any other name.). As you say, transform
can get a by
argument later.
I have to stop now, I will comment on it later, but the crucial part is select(df, :x => mean)
which currently in DataFrames.jl will be the same as aggregate(df, :x => mean)
not transmute(df, mean(x))
.
The key thing is that we accept any length of the output (not only 1 row or nrow(df)
rows). I will have to think about it and comment.
Ok. Yes, sorry, I do not completely understand the current syntax in master. Hopefully, though, I have communicated the kind of syntax I am hoping for.
Yes - I think it is clear. Let me summarize how I now think we can handle all the cases (also from the https://github.com/JuliaData/DataFrames.jl/issues/2172 discussion). I describe select
but transform
is analogous:
by
and combine
(as they will not be needed)select(df::AbstractDataFrame, ...; keeprows=true)
=> make sure that the returned data frame has number of rows exactly equal to nrow(df)
select(df::AbstractDataFrame, ...; keeprows=false)
=> returned data frame can have any number of rows - determined by the return values ; this is the defaultselect(gdf::GroupedDataFrame, ...; combine=true, keeprows=true)
=> return a data frame and make sure that each group has exactly the same number of rows and in the same order as in parent(gdf)
; if keeprows=true
then sort
argument in groupby
is ignoredselect(gdf::GroupedDataFrame, ...; combine=false, keeprows=true)
=> return a grouped data frame and make sure that each group has exactly the same number of rows and in the same order as in parent(gdf)
; if keeprows=true
then sort
argument in groupby
is ignoredselect(gdf::GroupedDataFrame, ...; combine=true, keeprows=false)
=> return a data frame with row order determined by the return values of ...
arguments and then sort
argument in groupby
is respectedselect(gdf::GroupedDataFrame, ...; combine=false, keeprows=false)
=> return a grouped data frame with row order determined by the return values of ...
arguments and then sort
argument in groupby
is respected; this is the default for GroupedDataFrame
optionDataFrame(::GroupedDataFramel; keeprows)
also add keeprows
keyword argument that will decide what the order of groups should be in the result@nalimilan - I can quickly implement it if we go for this. I guess all needs of @matthieugomez are satisfied with this design and also the requirement of tracking of row order after producing a data frame. For 0.21 it would be a simple implementation (not the fastest possible), but we can improve performance after 0.21 release.
Thank you for your input on this. I think there are definitely changes to make here.
bysort(id) egen x = mean(x)
in Stata instead of a keyword argument. As a result, I find by
very readable and intuitive. I think it reads well, "by this variable, do these operations". I would be sad to see it deprecated. select
will automatically collapse if it can. Think of select
as having force pushing down on the data frame and it only meets resistance if it sees a column with the full length. As a result, it does what you describe aggregate
to be. Consequently, the current implementation of your table looks like thisnot available |
not available |
---|---|
select(df, :, :x => mean) |
by(df, :id, :, :x => mean) |
select(df, :x => mean) |
by(df, :id, :x => mean) |
We do not have an operation to "spread" the mean of a variable across the full data frame, but drop other columns. This is a good lapse and I thank you for pointing it out.
I like by
. I think it is readable. Here is my attempt at a totally new syntax that I think would avoid all confusion. Here select
does not collapse by default. Rather it "spreads" results to match the input data frame length.
select(df, :x => mean) |
selectby(df, :id, :x => mean) |
---|---|
transform(df, :x => mean) |
transformby(df, :id, :, :x => mean) |
collapse(df, :x => mean) |
collapseby(df, :id, :x => mean) |
As you mentioned, there is a whole other dimension to this when we think about putting in grouped data frames. Below is my attempt at a full list.
data frame, keep cols, non-grouped operations, return dataframe with same size: *`transform(df, :x => mean)
select(df, :, :x => mean)
data frame, drop cols, non-grouped operations, return dataframe with same size:
empty
select(df, :x => (t -> fill(mean(t), length(t))))
, which is a mouthful. data frame, keep cols, grouped operations, return dataframe with same size:
by(df, :, :id, :x => mean)
transform(df, :x => mean; by = :id)
(I don't like this because you have to read a lot)transformby(df, :id, :x => mean)
(Would work if we renamed transform
to gen
)data frame, drop cols, grouped operations, return dataframe with collapsed size:
by(df, :id, :x => mean)
aggregate(df, :id, :x => mean)
data frame, drop cols, grouped operations, return dataframe with same size:
empty
(see above)select(df, :x => mean, by = :id)
selectby(df, :id, :x => mean)
grouped data frame, keep cols, grouped operations, return grouped dataframe with same size:
transform(gd, :x => mean)
select(gd, :, :x => mean)
grouped data frame, drop cols, grouped operations, return grouped data frame with collapsed size:
select(gd, :x => mean)
grouped data frame, keep cols, grouped operations, return dataframe with same size:
combine(gd, :, :x => mean)
grouped data frame, drop cols, grouped operations, return dataframe with collapsed size:
combine(gd, :, :x => mean)
I don't think I entirely agree with @bkamins's synthesis of my proposal. My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations. Moreover dataframes remain aligned: the row number 10 in the previous dataframe refers to the same observation as the row number 10 in the next dataframe.
To be honest, I'm not sure what the main difference between transform and select is in your proposal?
@pdeffebach I'm agnostic on whether we should have a transmute
actually. It can simply be obtained by using transform
(i.e. a function that keeps all rows and colums) and then select
so it's fine to not have it.
For transformby
, I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a by
kwarg does not work for you. But if people do hate this, I'm happy with just transform(x::GroupedDataFrame, ....)
, that returns a groupeddataframe, which can be combine
'd afterwards, as long as row order remains the same.
My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations.
I fully agree with this. I think that select
shouldn't collapse by default and we should have a separate function name for collapse
ing. Too many keyword arguments will get complicated. I propose collapse
if we don't get sued by Stata for using it.
I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a
by
kwarg does not work for you.
My opposition to keyword arguments is
by
is easy to read. transformby(df, :id, ...)
reads like a sentence. "by the column :id:
transform the data frame...". transform(df,
:income => mean,
:income => std,
:educ => first,
:startage => first, by = :id)
only to find out the operations are performed by group at the end.
I agree that transformby
is hard to type. I wouldn't mind gen
being used instead of transform, i.e. gen
and genby
. But I don't know how much support I would get for my idea.
I'm happy with just
transform(x::GroupedDataFrame, ....)
, that returns a groupeddataframe, and can becombine
'd afterwards
I'm happy with this, too. But you have convinced me of the need for collapse
and select
to be separate, at the very least.
I had a long discussion with @nalimilan about it but also having read your comments I have updated my thinking a bit and have the following conclusion.
We should have the following functions
select
for data frames => like current select
but ensuring that number of rows in the output is nrow
of inputtransform
for data frames => like current transform
but ensuring that number of rows in the output is nrow
of input (now it is not guaranteed)collapse
for data frames (I like the name as it is explicit) => this is exactly current select
(i.e. it takes everything as is and collapses as much as possible, but in particular it can also produce nrow
of input rows if transformations requested produce such a number of rows)select
and transform
should not work on GrupedDataFrame
- the reason is that in Julia GroupedDataFrame
is not just a data frame with information about grouping columns, such an object can have dropped groups or reordered them, so the basic contract of select
and transform
the way you propose them cannot be kept
select
and transform
can get a by
keyword argument (or be named selectby
or transformby
- this can be discussed which approach is better); in this case no groups are dropped (so we do not provide sort
nor skipmissing
etc. functionalities) as we want to guarantee that the result still has nrow
of input rows; in this case we guarantee that the output is a DataFrame
that has the same order and number of rows as the inputmap
, combine
and by
would work as they are working now (they are versions of collapse
but for GroupedDataFrame
); why they are needed? - because GroupedDataFrame
can be subsetted, reordered etc. - if someone does this then probably one wants to preserve the consequences of these operations in the result; we can discuss the names and in particular if by
is really needed (it is just a convenience for groupby
+combine
combo)DataFrame
for grouped data frames should get a keepkeys
keyword argument for consistency with combine
So in summary - the major difference from dplyr
is that in DataFrames.jl groupby
has a slightly different functionality than there and the consequence is that the table you have proposed would look like (assuming by
kwarg just to keep one convention):
select(df, :x => mean) | select(df, :x => mean, by=:id) |
---|---|
transform(df, :x => mean) | transform(df, :x => mean, by=:id) |
collapse(df, :x => mean) | by(df, :id, :x => mean) |
With the only caveat that collapse
could produce more than one row depending on what the transformations request. Similarly by
.
Then on top of this table we have combine
and map
on GroupedDataFrame
that are an independent functionality serving a different purpose as GroupedDataFrame
itself can be transformed as I noted.
Well, that would be awesome...
I guess my only question is: will it complicate the package too much? I realize that you and @nalimilan are basically the only maintainers these days, so I get that we want to avoid a multiplication of methods.
On a related note, I guess I don't really understand why GroupedDataFrame
is needed then, if select
, transform
, collapse
get a by
argument. Would not it dramatically simplify things to remove it or put it in a separate package?
GroupedDataFrame
is needed because it also:
GroupDataFrame
, subset it, work on groups using e.g. loop; like a recent discussion here: https://github.com/JuliaData/DataFrames.jl/issues/2194 which was about the need for such a functionality)SubDataFrame
matching this key) - this is something that is needed very often, e.g. JuliaDB.jl by design allows to set a key but DataFrame
does notThe basic building blocks of what is proposed above is something we have now:
combine
(which would stay unaltered)select
(which would be renamed to collapse
)Now (technical implementation migtht differ a bit):
select
and transform
without by
(so not grouping) would be simply ensure that all output has nrow(df)
rows which is very simple (you just need to check if the first column you produce has this number of rows, and the rest is the same as we already make this check anyway but just allowed any number of rows)select
and transform
with by
(so grouping) would just use by
function but adding one virtual column axes(df, 1)
, call it :__dummy__
to a data frame that would contain axes(df, 1)
values, then we would add just one operation :__dummy__
(i.e. retain __dummy__
) column to by
processing. In this way we ensure: a) each group may not change its length, b) after by
finishes we use idx
field of GrupedDataFrame
to recover the original order of rows in linear timeSo in summary - the change is relatively small. Fortunately none of this functionality exists in 0.20 so we can do whatever we want and the current design for 0.21 is flexible enough to cover all that you request here relatively simply.
An additional comment based on the discussion with @nalimilan on Slack (this is my understanding of things based on the input from @matthieugomez, @pdeffebach and @nalimilan + my own opinions):
Q1: Why we do do not allow passing GroupedDataFrame
to select
and transform
?
A1: Because select
and transform
have an invariant that they always return nrow(df)
rows and they are in the order of rows in df
. The problem is that GroupedDataFrame
has sort
and skipmissing
kwargs that respectively reorder and remove rows from df
. Aslo GroupedDataFrame
can be subsetted/reordered. Theoretically we could define what consequences reordering and subsetting should have on the result of select
taking such a GroupedDataFrame
but such rules would be complex, and for sure not easily grasped by the regular user of DataFrames.jl. Also - if we really find it useful in the future adding support for GroupedDataFrame
passed to select can be always added as it will be non-breaking.
Q2: Why do we need collapse
as a separate function.
A2: since select
and transform
have a very clear contract "always return nrow(df)
rows and they are in the order of rows in df
" adding a kwarg to them that would allow to choose if we want all rows or any number of rows would be a mental overload. It is much better to have a separate function that clearly signals that it can change number of rows. Now it is enough to have combine
that is similar to select
as for transform
it does not really make sense to have a "collapsing" behaviour (as we want to keep existing columns which have nrow(df)
rows anyway)
Q3: Why do I prefer select
and tansform
to have by
keyword argument rather than define selectby
and transformby
functions.
A3. This is a tough decision. I was thinking about it and I think that it is better not to pollute the namespace with too many functions. by
will be a kwarg, so it should be clear enough what the behaviour is (this is like with select
in SQL, and actually better than syntax in dplyr
where you have to know what was the type of the input to know how the function will work. Here we will have a clear visual signal of by
keyword that we are working in groups).
Q4. Why do we keep combine
and map
for GroupedDataFrame
.
A4. First of all - they are useful. You firs "tweak" your GroupedDataFrame
(like subsetting, reordering etc.) and then call them to get a desired result. Also it is better to be backward compatible than be breaking if there is no clear benefit of being breaking. Also map
and combine
have a bit different API that in particular allows for do
-syntax, which again is useful if we are in "collapse" mode.
Q5. Should we keep by
or deprecate it and instead define collapse
with by
keyword argument
A5. After thinking my preference is to deprecate by
. The reason is to reduce the name-space pollution, especially by such a short function name. The deprecation would go:
by(arg, df, key, kwargs...)
is deprecated to be combine(arg, groupby(df, key, kwargs...), kwargs...)
by(df, key, arg, kwargs...)
is deprecated to be combine(groupby(df, key, kwargs...), arg, kwargs...)
by(df, key, args..., kwargs...)
is deprecated to be collapse(df, args..., by=key, kwargs...)
In particular this will resolve the problem I have always had that by(df, :x1, :x1, :x2)
is a bit hard to parse - you have to remember that the second positional argument is key, but in the context `by(fun, df, :x1)
actually third positional argument is key.
In summary the table we discuss would be | expected result | ungrouped operation | grouped operation |
---|---|---|---|
retain number and order of rows, drop old columns | select(df, :x => mean) |
select(df, :x => mean, by=:id) |
|
retain number and order of rows, keep old columns | transform(df, :x => mean) |
transform(df, :x => mean, by=:id) |
|
any number of rows, order of rows determined by the way you perform grouping (sort and skipmissing kwargs in particular), drop old columns |
collapse(df, :x => mean) |
collapse(df, :x => mean, by=:id) + old combine and map applied to GroupedDataFrame + deprecate by |
additionally we support keepkeys
kwarg for all functions (determining if grouping columns should be retained or not). This kwarg should be also added to DataFrame
working on GroupedDataFrame
.
Now the important thing is that these definitions have a very nice feature that left column (ungrouped operation) is exactly right column (grouped operation) when by=[]
(which will be the default) as by=[]
essentially creates one group containing a whole data frame. I think it is a nice symmetry showing that the design is consistent.
Please up- or down- vote this proposal. I am now personally convinced to go this way so if no opposing voices will be made I will make a PR implementing this (fortunately it is relatively easy) when #2199 is merged so that we can include it in 0.21 release.
Well I think it's terrific. Thanks a lot @bkamins. (also @jmboehm may be interested).
I have limited experience with the current interface, but for what it's worth, I think both the request and the proposal make a lot of sense.
by
function, mainly because it's not clear what it does by itself. All other functions that come to mind here are verbs. Having by
as kwarg reads more naturally to me.keepkeys
be true
by default, at least for those where the contract is that columns are being preserved (but my preference would be to keep them everywhere by default, it's more prudent).select
does more than selecting columns. When I looked at the package for the first time, I was very confused what select(df, :a => :c)
means. In my view, the most obvious interpretation when you have only high school math background is that I'm selecting something based on the logical condition "a implies b". It's a more minor thing for me because I can just use select
for the things I think it should do, but my concerns would be that newcomers find the syntax hard to read. (Or perhaps I'm just generally not very fond of "=>". What does that mean? "implies", "is mapped to", or an assignment operation?)I'm not entirely sure how we are meant to perform operations on subsets of the rows, without deleting the other rows. This is one of the issues I've been facing when implementing a Stata-like interface. But perhaps that's an issue for another day.
Thanks a lot to everyone involved!
@bkamins I think this is a great idea. I still maintain a dislike for keyword arguments.
I have implemented, in the crudest way possible, a non-keyword argument version of all this in a PR here. I use Stata-esque names to avoid name conflicts with existing functions. I think everyone in this thread will be able to understand what each function does.
I think that we should deprecate by
with the Pairs
argument. However I think
by(df, :a) do sdf
f(sdf)
is still a very powerful and convenient syntax.
For reference, here is my list of new functions implemented in #2210
gen
: makes new column, keeps old columns, preserves nrow
genby
: makes new column by group, keeps old columns, preserves nrow
keep
: makes new column, deletes old columns, preserves nrow
keepby
: makes new column by group, deletes old columns, preserves nrow
collapse
: makes new column, deletes old columns, nrow
is not preserved (returns 1 row data frame)collapseby
: makes new column by group, deletes old columns, nrow
is not preservedagggen
: takes in grouped data frame, makes new columns, keeps old columns, returns data frame with same nrow
aggkeep
: same as agggen
but deletes old collumnsaggcollapse
: takes in grouped dataframe, collapses. @jmboehm Can you file an issue about replicating stata's if
syntax? I have some ideas, but we should keep discussion focused on this subject.
Just a few comments regarding @bkamins' detailed proposal. Overall I like it, I just have a few reservations:
combine
and collapse
, with quite similar behaviors. Keeping the API simple and without ambiguous names is essential. I suggest we use a single function, which could be combine
or use another name. The name "collapse" implies you retain a single summary statistic (like in Stata), so I don't really like it for a function which allows returning any number of rows. dplyr's summarize
is better I think (you can "summarize" a group of rows into a different group of rows), though probably not ideal (but note that dplyr now allows returning multiple rows). combine
has the advantage of not breaking existing code, even if it sounds a bit weird for a single DataFrame
. Anyway the name can be discussed after choosing the design.select
and transform
should work on GroupedDataFrame
, so that you can write operations in steps (like in dplyr). It's better in general to provide composable objects and functions. We can restrict them to the case where no groups have been skipped and they are not sorted (at least as a first step).by
keyword arguments to select
and transform
sounds convenient, if we allow these operations on GroupedDataFrame
they are just syntactic sugar to avoid calling groupby
, so I'd be inclined to leave these out for now until we have settled the rest of the API. They can be added at any time, contrary to breaking changes.@jmboehm thank you for your thoughts. Here are my comments to the questions you have raised.
Or perhaps I'm just generally not very fond of "=>". What does that mean? "implies", "is mapped to", or an assignment operation?
So :a => :b
is a shorthand for :a => identity => :b
.
And :a => fun => :b
means pass column :a
to fun
and store the result in :b
.
how we are meant to perform operations on subsets of the rows, without deleting the other rows
use a view
EDIT - oh - I understand you want to do some operation conditionally on some other column?
something like [:a, :b] => (a, b) -> ifelse.(a .< 0, b, 0)
?
@nalimilan thank you for this feedback. I think there is a consensus around this kind of proposal, unless there are truly insurmountable edge cases.
Do you imagine transform(gd::GroupedDataFrame)
to return a grouped dataframe. Perhaps a keyword argument to all 3 functions (keep
, gen
, and collapse
for now), which preserves grouping and does not return a data frame.
I have created #2211 for discussion about if
syntax and subsetting.
I'll just give my two cents about kwargs vs type, but, in any case, I'm happy with either proposal.
On the one hand, I like @nalimilan's proposal because it keeps everything super simple.
On the other hand, I'm worried that GroupedDataFrames is so different from DataFrame than it makes too complicated. If you end up, in a future version of DataFrames, allowing a by
kwarg in transform
, then the whole syntax becomes more complicated in the end.
Is it worth thinking about defining a GroupedDataFrame
type that is much more similar to DataFrame
, as in dplyr
? The current GroupedDataFrame
type could then be renamed to GroupedDataFrameIterator
, with a function eachgroup(df::GroupedDataFrame) -> GroupedDataFrameIterator
I guess something I don't understand in @nalimilan's proposal is whether the output of transform
/select
of a GroupedDataFrame is a GroupedDataFrame or a DataFrame.
@nalimilan - thank you for the comment. I will summarize the possible design under these circumstances in the post that follows (including @pdeffebach's comment about returning a DataFrame
or a GroupedDataFrame
).
Here let me add some more general comments.
@matthieugomez - I think @nalimilan wants to give an option to choose what should be the output. i also comment on this below.
Just as a note to @pdeffebach (rephrased what I commented in #2210). In DataFrames.jl we want to provide a minimal set of functions that provide the required functionality. Therefore if we end up with the design in which:
by(df, :a) do sdf
f(sdf)
end
will be possible to achieve in a different and easy enough way we will probably remove it. I know that in the past DataFrames.jl did not always follow this rule strictly and everywhere but as we go for 1.0 this is needed:
For example in #2211 I assume we will clarify what functionality is needed. If is is easily achievable when composing current functionality then probably we will not add it, but if we decide that it is very hard to achieve it without adding something to the "core" of the package then we will add it.
The issue about what GroupedDataFrame
is a focal point here.
It is also related to https://github.com/JuliaData/DataFrames.jl/issues/2106 as this issue should be resolved also (e.g. in theory someone might want to expand a 0-row group into something).
Given the amount of decisions that are to be made I start getting a feeling that we will not be able to resolve all issues in a way in 0.21 release, so that we ill not be breaking later. So maybe we will have to decide on allow for breaking changes between 0.21 release and 1.0 release.
What I think the consensus is is what we need for operating on a DataFrame
. And this for sure can "go into" 0.21 release. If we decided to stay with this we would for the time being leave by
and combine
in 0.21 release as legacy and announce that after 0.21 release a redesign of split-apply-combine infrastructure will be done. But maybe (and hopefully) we will quickly settle on the functionality for split-apply-combine part. Then it can also go into 0.21 release.
So the functionality for data frame seems to require three functions, all of them take a a data frame and return a DataFrame
:
select
(and I assume there is a consensus on this)transform
(and I assume there is a consensus on this)collapse
, combine
, summarize
, apply
In this thread of the discussion I think we can concentrate on finding an appropriate name for the third function.
The functionality of the grouped data frame is more complex. The starting point is what @matthieugomez said about adding another type (a la dplyr). The good thing is that currently we do not allow to change GroupedDataFrame
in-place (but only to generate a new GroupedDataFrame
based on it). If we kept this approach then we could add a second parameter to a GroupedDataFrame
that would signal if it in cannonical form (not reordered nor subsetted) or modified in some way (this should be enough for our purposes I think - but maybe I do not see something so please comment). The benefit of such an approach is that we would not redesign everything from scratch.
Below I assume that we will settle on combine
name in the "data frame passed" case.
Then following the proposal of @nalimilan:
select
, transform
and combine
with functionalities as defined above (actually combine
then can be left as it is implemented now).select
, transform
(at least for now) will accept only GroupedDataFrame
in cannonical form (i.e. not subsetted nor reordered), and combine
will accept any GroupedDataFrame
.by
keyword argumentby
function will be also removedkeepkeys
keyword argument with the current meaningkeepgrouped
keyword argument (name is tentative) which if true
means that they return a cannonical GroupedDataFrame backed by a freshly allocated DataFrame
and otherwise they return a DataFrame
DataFrame
will get a keepkeys
keyword argumentSo essentially the difference in comparison to the earlier proposal is dropping by
keyword argument and instead dispatching on type and adding one extra keyword argument that will govern the type of the return value.
I have updated #2210 with the proposal described by Bogumil above, still with the names gen
, keep
, and collapse
. I think it's a very good proposal! I would be very happy to work with it and would trust myself to promote it to new users!
I don't want to complicate the issue, but at some point, I guess, there will be a !
versions of select
, transform
, collapse
(after 1.0 ;)). Could you expand on how this would work? For instance, could we make sure that the proposal would, in the end, allow to create a new column with the mean of :x
within groups defined by :id
without having to duplicate the original DataFrame?
Tbh, I don't really see an issue with using transform
/select
on a non-canonical grouped DataFrame. The thing that is important, I think, is that transform
/select
do not sort or remove rows between the argument and the output.
What I was suggesting is to create a type GroupedDataFrame
that inherits from AbstractDataFrame
. This would retain the composability of @nalimilan's proposal without having to transform DataFrames into things that are not DataFrames when computing means by group. The current behavior of GroupedDataFrame
could still be obtained by eachgroup(df::GroupedDataFrame)
, the same way graphemes(x::String)
allows for a different kind of iteration on String
s in Base.
I guess, there will be a
!
versions ofselect
,transform
,collapse
for a DataFrame
argument yes - and this is already implemented.
For GroupeDataFrame
this is problematic. Let me give a quick example of an R session:
> df <- data.frame(g=c(1,1,1,2,2),x=1:5)
> gdf <- group_by(df, g)
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
g `mean(g)` `mean(x)`
<dbl> <dbl> <dbl>
1 1 1 2
2 2 2 4.5
> gdf$g[1] <- 3
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
g `mean(g)` `mean(x)`
<dbl> <dbl> <dbl>
1 1 1.67 2
2 2 2 4.5
I do not like this design.
In DataFrames.jl we are clear that GroupedDataFrame
is a view and views are by definition assuming that their parent is not mutated. Also note that we have SubDataFrames
in DataFrames.jl and GroupedDataFrame
could be defined on such object.
So - in short:
by
argument these !
functions can be provided as we do not expose GroupedDataFrame
to the end user.GroupedDataFrame
passed around these functions are problematic to provide. What you would probably do is generate a 1-column data frame and in the second step add this column to the original data frameI don't really see an issue with using transform/select on a non-canonical grouped
DataFrame
.
Well take this GroupedDataFrame
:
julia> df = DataFrame(g=[3,1,2,3,1,2], x=1:6)
6×2 DataFrame
│ Row │ g │ x │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 1 │
│ 2 │ 1 │ 2 │
│ 3 │ 2 │ 3 │
│ 4 │ 3 │ 4 │
│ 5 │ 1 │ 5 │
│ 6 │ 2 │ 6 │
julia> gdf = groupby(df, :g)
GroupedDataFrame with 3 groups based on key: g
First Group (2 rows): g = 3
│ Row │ g │ x │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 1 │
│ 2 │ 3 │ 4 │
⋮
Last Group (2 rows): g = 2
│ Row │ g │ x │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 2 │ 3 │
│ 2 │ 2 │ 6 │
julia> gdf2 = gdf[[3,2]]
GroupedDataFrame with 2 groups based on key: g
First Group (2 rows): g = 2
│ Row │ g │ x │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 2 │ 3 │
│ 2 │ 2 │ 6 │
⋮
Last Group (2 rows): g = 1
│ Row │ g │ x │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 1 │ 5 │
what should be the result of select(gdf, :x => identity => :x)
and select(gdf2, :x => identity => :x)
?
What I was suggesting is to create a type
GroupedDataFrame
that inherits fromAbstractDataFrame
.
I understand this, but could you please specify exactly how this type should behave exactly (should it be a view or a copy, should it precompute groups on creation and store them or only carry over information about grouping columns but do not store them, should it have a special column type for grouping columns that make them read only, should you be allowed to add rows to it, should you be allowed to add columns to it, should you be allowed to remove columns from it, etc. - probably the list here is longer, but these are probably major questions). Maybe you have answers for these questions and then we can quickly move forward. If not, and if you find the current design (trying to slightly modify current GroupedDataFrame
only without designing a new type) insufficient, then what you propose should be discussed but it is probably several months of heavy design work (which is OK, but it means that we go for the option: release 0.21 without significantly modifying split-apply-combine ecosystem, and work on the breaking changes after that release).
in the proposal by @nalimilan with GroupedDataFrame passed around these functions are problematic to provide. What you would probably do is generate a 1-column data frame and in the second step add this column to the original data frame
This does not sound good. I think it's a good reason to prefer kwargs
.
Well technically such select!
could work on a cannonical GroupedDataFrame{DataFrame}
but would have to be disallowed for GroupedDataFrame{SubDataFrame}
. However, I found it a bit confusing (but now that I think of it maybe this could be allowed with this distinction).
Let me summarize the pros and cons of both proposals:
Using kwarg by
:
DataFrame
which you are normally sure is a type that is stand-alone and "owns" its columns (you could start with SubDataFrame
but this is not a problem I think as simply you will be notified that !
methods do not work for it and after the first step of transformations you have a DataFrame
anyway); in particular !
methods can be provided here without a problemGroupedDataFrame
can be left "as is", there is no need to add a third special typeDataFrame
metadata, but this is a separate issue and it will be only a convenience you have to manually decide to use)GroupedDataFrame
you can perform grouping once, and then do 100 combine
s without having to pay for grouping); of course this only applies to functions with by
kwarg; you will still be able to do this with combine
(and just remember that the other functions essentially will be wrappers for combine
with some extra constraints)combine
works on GroupedDataFrame
as it currently does, but we need some other function name that takes a data frame and by
argument (currently this is our by
but it should be replaced by a name that matches the same operation without by
)DataFrame
s that is fully featured and one "small" flow for GroupedDataFrame
sGroupedDataFrame
by non mutating functions; so we would have both select(::DataFrame, ..., by=...)
and select(::GroupedDataFrame, ...)
and select!(::DataFrame, ..., by=...)
but no select!
for GroupedDataFrame
(or only for GroupedDataFrame{DataFrame}
as commented above); but it would seem to be a duplication of functionality.Not using by
kwarg, but allowing passing a grouped object to functions:
AbstractDataFrame
; I will not repeat the arguments from above - maybe using a GroupedDataFrame
is enough if we introduce the distinction between "cannonical" and "non-cannonical" and differentiate the behavior between GroupedDataFrame{DataFrame}
and GroupedDataFrame{SubDataFrame}
. But the major CON here is that there is a lot of decisions to be made if we go this pathAs I have written it down I feel I would prefer option 2 in the long term (as proposed by @nalimilan), but for it to work we need to work out the "grouped data frame" case.
You have not commented on the problems I see with select
on "non-cannonical" data frames and the problems I see with the current design of the result of group_by
in dplyr. It would be valuable if you gave your opinion for it. But assuming you agree with me and we stay with the current GroupedDataFrame
(which is easiest implementation-wise) then we could have:
select
and transform
accept any "cannonical" GroupedDataFrame
;select!
and transform!
accept any "cannonical" GroupedDataFrame{DataFrame}
;combine
(or other name if we decide to change it) accept any GroupedDataFrame
;combine!
(a companion method to combine
mutating the parent
) - would not be defined as combine
changes number and order of rows so it does not make sense to mutate the parent
with the output;then select!
, transform!
methods would mutate the parent
of the GroupedDataFrame
(which would automatically be reflected in GroupedDataFrame
as it always takes all its columns).
I was thinking about it today some more and my conclusion is the following:
select
or transform
which should be free to use by extension packages.Here I write what conclusion I have for this high level API currently (but bear in mind that given what I have just written actually I think it is encouraged to design other high-level APIs that could be built on top of the "core" low level API):
We should provide 5 functions (in the scope of this issue):
select
, select!
, transform
, transform!
and combine
.AbstractDataFrame
or a GroupedDataFrame
GroupedDataFrame
can be in "cannonical" state and "non-cannonical" state; for now only combine
would allow processing GroupedDataFrame
in non-cannonical state (maybe in the future we can allow other functions to allow non-cannonical GroupedDataFrame
but this can be decided later - even if we do I do not think it is crucially useful)The signatures would be:
select(::AbstractDataFrame, args...; copycols)
: ensure we keep all rows, drop old columns; create a new DataFrame
select!(::AbstractDataFrame, args...)
: the same as above but modify source data frametransform(::AbstractDataFrame, args...; copycols)
: ensure we keep all rows, keep old columns; create a new DataFrame
transform!(::AbstractDataFrame, args...)
: the same as above but modify source data framecombine(::AbstractDataFrame, args...)
: any transformations, always copy cols, drop old columns; create a new DataFrame
combine(arg, ::AbstractDataFrame)
: the same but allow arg
to be a function also (for consistency with GroupedDataFrame
version)select(::GroupedDataFrame, args...; copycols, keepkeys, regroup)
: ensure we keep all rows in order, drop old columns; accept only cannonical GroupedDataFrame
select!(::GroupedDataFrame{DataFrame}, args..., keepkeys, regroup)
: the same as above but modify parent DataFrame
transform(::GroupedDataFrame, args...; copycols, keepkeys, regroup)
: ensure we keep all rows in order, keep old columns; accept only cannonical GroupedDataFrame
transform!(::GroupedDataFrame{DataFrame}, args..., keepkeys, regroup)
: the same as above but modify parent DataFrame
combine(::GroupedDataFrame, args..., keepkeys, regroup)
: current combine
, except that if one args
is passed we are not flexible (we thought it is OK to be flexible and allow the same as in combine(arg, ::GroupedDataFrame, keepkeys, regroup)
but it would be insonsistent with select
and transform
so we disallow it - still the functionality is available with arg
in first position)combine(arg, ::GroupedDataFrame, keepkeys, regroup)
: current combine
regroup
can be true
in which case a cannonical GroupedDataFrame
is returned, if false
(the default) a DataFrame
is returned. We allow for this kwarg because it is more efficient to regroup
immediately (we know how to group without having to compute grouping again). If regroup=true
we throw an error if keepkeys=false
(it does not make much sense otherwise)
Other kwargs have current meaning.
Note that for select!
and transform!
for GroupedDataFrame
it does not matter much what we set for regroup
as both parent DataFrame
and GroupedDataFrame
passed will be modified.
Now why I opt for combine
name for the last operation. First - it is non breaking. Second, we will read it as "combine rows", which makes some sense (none of the functions we considered were ideal).
Given this design both by
and map
for GroupedDataFrame
would be deprecated (by
duplicates functionality that is easy to get otherwise, map
will be just combine
with regroup=true
). In this way we are free to decide what would be the use of map
.
Finally DataFrame
on GroupedDataFrame
would get keepkeys
kwarg.
Sorry that the posts are lengthly, but we are desinging a complex ecosystem and details matter; I hope I have not mixed up something in the descriptions. I hope this design is something you find acceptable and useful. If yes. After #2199 I would go forward to implement it.
Also as I have noted above after 0.21 I would go forward to decouple all this from the low-level API (that would be moved to DataFramesBase.jl) to allow other high-level APIs to be implemented (this one would be just a reference implementation - still with the aim to be useful).
Honestly I don’t know enough about how people use groupeddataframes to bring substantive points to this discussion (kwarg vs gdf). That being said, the final proposal looks good to me. Thanks a lot — especially for considering making transform! work with grouped data frames.
+1 to all of this. I can't comment on the contract about mutating the parent of a grouped data frame, but if that's surmountable it would be great. I went over your R example and also find the behavior unintuitive. Perhaps we can disallow modifying a key column.
Thank you everyone for their detailed proposals and thoughtful comments.
the contract about mutating the parent of a grouped data frame
It is "surmonutable" as you say for "cannonical" GroupedDataFrame
as it is 1 to 1 mapping to the parent data frame (no rows are removed/reordered) and fortunately GroupedDataFrame
does not subset columns. Finally select!
and transform!
guarantee not to reorder/subset rows. This means that if we modify the parent the GroupedDataFrame
will still be valid (this prompts me that probably keepkeys
should be disallowed in select!
and transform!
and it should be always true
(in other words - if you want to do a !
operation on GroupedDataFrame
you are not allowed to remove grouping columns from the parent), as otherwise derived GroupedDataFrame
would be invalidated - I guess it is not a problem and is an intuitive case).
Taken all this into consideration the only risk is that some other GroupedDataFrame
s would be backed by the same parent but with different grouping columns - and that other GroupedDataFrame
would get invalidated, but I think we can be explicit enough in the documentation to warn users about this case.
Note that this is a different situation than the one we discuss in #2211. The problem is that if you add a column to a SubDataFrame
with some name, this column potentially can exist in the parent data frame so there would be a conflict (still this also is fixable but first I would like to understand in #2211 if this is really needed to be added).
The general thinking is that view
should not modify parent
(if possible) as there are potentially other view
s based on the same parent
that might get invalidated and it is easy to forget about it (but as I have said - for GroupedDataFrame
I think we can allow this as this is a very specific use case - as opposed to setindex!
in #2211 which is a very fundamental operation).
Perhaps we can disallow modifying a key column.
In the proposed design the problem that dplyr has does not exist (we explicitly check if grouping column has remained unchanged in combine
which will be still a workhorse of the whole solution). We would have this problem if we created a new type DataFrameWithGroups
(call it tentatively) that would be an AbstractDataFrame
but with information about grouping columns. If we went this way (we currently do not, but some "extra" package is free to define it) then grouping columns should probably be stored as https://github.com/bkamins/ReadOnlyArrays.jl. But this design has some problems as when we make an array read only we lose its type information and Julia does not allow multiple inheritance currently and most of array types are not trait based, chiefly CategoricalArray
and PooledArray
would be a problem and they are important for performance reasons. Again - these cases could be worked around, but I did not want to overly complicate things on top of the current design.
Thank you all for discussing this.
Sounds like a good plan! I think discussions about special cases can be handled later, like what to do with "non-canonical" GroupedDataFrame
(in which groups have been dropped or reordered).
Regarding the idea of having GroupedDataFrame <: AbstractDataFrame
, that was my goal originall. But I asked Hadley Wickham and he said he would have liked to change this in dplyr so that group_by
returns an object which is not a data.frame
. Then I realized that we don't really need GroupedDataFrame
to behave like a DataFrame
for most basic operations: for example, it's not very useful to have gdf.col
return its parent's column since that doesn't help you to perform by-group operations. OTOH what is definitely useful is to have select
and transform
work like for DataFrame
, but operating by groups (what we are discussing here).
@matthieugomez Do you see something in particular that having GroupedDataFrame <: AbstractDataFrame
would allow that wouldn't be possible otherwise?
@nalimilan I'm not sure. My initial reaction is that I find it confusing to do gdf=groupby(df)
, see something in the REPL that looks completely scrambled, but then have tranform!(gdf)
returns the original df
with an added column. But maybe I'm wrong and it is not that confusing.
As a side note groupby(df)
is not allowed.
Now - the issue you report is printing related. The original thinking (not mine - it was implemented long before I started working on DataFrames.jl) was that it is more useful to show groups in GroupedDataFrame
rather than one table just with e.g. information about grouping columns and number of groups. But we can change it. If you have some better proposal for show
please open a separate issue as it is only output related.
@bkamins My point about printing was really not intended to be a proposal, just an answer to @nalimilan's question. I am really not knowledgeable enough to have a well formed proposal about GroupedDataFrames.
That being said, if we are all still hesitant about how to work with GroupedDataFrames
, and you want DataFrames
to be 1.0 soon, maybe you could consider separating the type GroupedDataFrame
in a different package (together with its methods for select
/transform
/combine
).
maybe you could consider separating the type
GroupedDataFrame
in a different package
This is exactly the plan, these functions would not go to DataFramesBase.jl.
if we are all still hesitant
Well - I am not hesitant, though I understand that different users might find different things useful. Note however, that there is little value added of making GroupedDataFrame
a subtype of AbstractDataFrame
(you can always use its parent
to do whatever you like). While the key benefit of GroupedDataFrame
is a fast lookup of groups with a convenient interface for it:
julia> df = DataFrame(a=repeat(1:3, 4), b=repeat(1:2, 6), c = 1:12)
12×3 DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │
│ 3 │ 3 │ 1 │ 3 │
│ 4 │ 1 │ 2 │ 4 │
│ 5 │ 2 │ 1 │ 5 │
│ 6 │ 3 │ 2 │ 6 │
│ 7 │ 1 │ 1 │ 7 │
│ 8 │ 2 │ 2 │ 8 │
│ 9 │ 3 │ 1 │ 9 │
│ 10 │ 1 │ 2 │ 10 │
│ 11 │ 2 │ 1 │ 11 │
│ 12 │ 3 │ 2 │ 12 │
julia> gdf = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = 1, b = 1
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
│ 2 │ 1 │ 1 │ 7 │
⋮
Last Group (2 rows): a = 3, b = 2
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 3 │ 2 │ 6 │
│ 2 │ 3 │ 2 │ 12 │
julia> gdf[(a=2, b=1)]
2×3 SubDataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 1 │ 5 │
│ 2 │ 2 │ 1 │ 11 │
If we wanted GroupedDataFrame
a subtype of AbstractDataFrame
this interface would be problematic and a fast lookup by key is a functionality that is very useful in many contexts so it should be easily accessible.
As a note gdf[(a=2, b=1)]
is not only convenient (as this is subjective) but it is also very fast (as fast as Dict
lookup in Base). To be honest - I am not really clear how you can achieve it quickly (i.e. in a way that is fast) in dplyr - if you know and could comment on it it would be an interesting comparison.
I've followed a bit the recent updates of this package. This is very impressive — thanks @bkamins and @nalimilan for all your work.
I have one comment about the syntax. An operation such as computing the mean of a variable in a dataframe can be classified along two dimensions (i) whether the new dataframe as the same size as the original dataframe (ii) whether it is a by operation or not.
dplyr and stata make it very easy to alternate across these two dimensions.
in dplyr
In stata:
This is very neat. There is a symmetry between top vs bottom, and left vs right. People can understand what these commands do just by reading them
The current syntax of Dataframes.jl is not as neat IMO. On the current master, we have:
I wish DataFrames.jl would follow the example of dplyr and stata here. For instance, a nice syntax could be the following:
In short, my suggestion would be to (i) remove the function
by
(ii) allowby
kwarg intransform
(ii) define a new function that would be the Julia equivalent of summarize (dplyr)/ collapse (stata).