JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Cleaner syntax #2206

Closed matthieugomez closed 4 years ago

matthieugomez commented 4 years ago

I've followed a bit the recent updates of this package. This is very impressive — thanks @bkamins and @nalimilan for all your work.

I have one comment about the syntax. An operation such as computing the mean of a variable in a dataframe can be classified along two dimensions (i) whether the new dataframe as the same size as the original dataframe (ii) whether it is a by operation or not.

dplyr and stata make it very easy to alternate across these two dimensions.

in dplyr

mutate(df, mean(x)) mutate(group_by(df, id), mean(x))
summarize(d, mean(x)) summarize(group_by(df, id), mean(x))

In stata:

egen x = mean(x) egen x = mean(x), by(id)
collapse (mean) x collapse (mean) x, by(id)

This is very neat. There is a symmetry between top vs bottom, and left vs right. People can understand what these commands do just by reading them

The current syntax of Dataframes.jl is not as neat IMO. On the current master, we have:

select(df, :, :x => mean) by(df, :id, :, :x => mean)
select(df, :x => mean) by(df, :id, :x => mean)

I wish DataFrames.jl would follow the example of dplyr and stata here. For instance, a nice syntax could be the following:

transform(df, :x => mean) transform(df, :x => mean, by = :id)
aggregate(df, :x => mean) aggregate(df, :x => mean, by = :id)

In short, my suggestion would be to (i) remove the function by (ii) allow by kwarg in transform (ii) define a new function that would be the Julia equivalent of summarize (dplyr)/ collapse (stata).

bkamins commented 4 years ago

This is essentially https://github.com/JuliaData/DataFrames.jl/issues/2172, which will be added later (as it is non-breaking). We leave by mainly for backward compatibility reasons.

As for mutate vs summarize - why do you see that this distinction is needed? If I understand the functionality there correctly we already provide it with a single function (currently select/transform and combine/by).

EDIT

Just to expand transform(groupby(df, :col1), :col2 => mean) will keep the result a GroupedDataFrame (this is also what dplyr does), while by will transform it to DataFrame as it does now. So the API will be minimally different.

bkamins commented 4 years ago

So - to summarize - unless you have some comment why we should distinguish "single row" vs "all rows" functionality (we currently do not and dplyr does, but I do not see a huge benefit of this distinction) then this issue can be closed and please comment in #2172 on the functionality select/transform should have for GroupedDataFrame (they key discussion is row order of the output as you will note there).

matthieugomez commented 4 years ago

Yes, I think different methods should do single rows vs all rows. The current syntax for the top right quadrant vs bottom right quadrant is not good. It’s very important to separate between the two when doing data wangling.

matthieugomez commented 4 years ago

Also, even looking only at the top quadrant. by vs transform is not satisfying. I would rather have: transform(df::DataFrame, :x => mean) -> Dataframe transform(df::DataFrame, :x => mean, by = :id) -> Dataframe transform(df::GroupedDataFrame, :x => mean) -> GroupedDataframe

bkamins commented 4 years ago

It’s very important to separate between the two when doing data wangling.

Can you please comment on this more? Why it is so important? My understanding is that it is what the function you apply returns determines the shape of the output.

EDIT in particular this function can neither return 1 row nor that many rows as were in the original group but something completely different.

bkamins commented 4 years ago

I would rather have:

This is doable and essentially depends on whether we want to allow by argument in transform and can be added in the future.

matthieugomez commented 4 years ago

See this discourse for instance: https://discourse.julialang.org/t/dataframesmeta-jl-and-the-state-of-the-dataframes-ecosystem/36221/7

bkamins commented 4 years ago

So essentially you would want to extend keepkeys kwarg functionality in the following way:

Do I understand your request correctly?

matthieugomez commented 4 years ago

I would like by to be deprecated and split into two functions aggregate(..., by) vs tranform(..., by).. I dont think I have anything to say wrt keepkeys.

bkamins commented 4 years ago

but by(...) would not be the same as transform(..., by) but rather the same as select(..., by).

Also currently we cannot use aggregate, because we have to go through a deprecation period.

Finally then aggregate should be two functions - one pairing select and another pairing transform. What names would you use for both functionalities?

So that is why I am asking if it is OK with you if we achieve the functionality you describe using kwargs only that would be added to select and transform:

If we went this way then by and also combine should get deprecated. But adding by kwarg and the consequences probably will not happen in 0.21 release.

bkamins commented 4 years ago

just to add - what you write you want is currently available by writing (disregarding some corner cases for a while not to complicate the discussion):

by(df, :key, :key, :some_col => mean)

simply we would make by(df, :key, :some_col => mean, keepkeys=true) to expand to that form.

matthieugomez commented 4 years ago

I tried to make my point without talking about whether or not you want to keep all other columns. With this added complication, the full table in dplyr is:

transmute(df, mean(x)) transmute(group_by(df, id), mean(x))
mutate(df, mean(x)) mutate(group_by(df, id), mean(x))
summarize(d, mean(x)) summarize(group_by(df, id), mean(x))

Note that there is no need for summarize that keeps all existing columns (not sure what you mean by "aggregate should be two functions...").

I am suggesting:

select(df, :x => mean) select(df, :x => mean, by:id)
transform(df, :x => mean) transform(df, :x => mean, by = :id)
aggregate(df, :x => mean) aggregate(df, :x => mean, by = :id)

So in short, I would like by to deprecated, and split between select(..., by) and aggregate(.., by) (the name aggregate is just an example, it can be any other name.). As you say, transform can get a by argument later.

bkamins commented 4 years ago

I have to stop now, I will comment on it later, but the crucial part is select(df, :x => mean) which currently in DataFrames.jl will be the same as aggregate(df, :x => mean) not transmute(df, mean(x)).

The key thing is that we accept any length of the output (not only 1 row or nrow(df) rows). I will have to think about it and comment.

matthieugomez commented 4 years ago

Ok. Yes, sorry, I do not completely understand the current syntax in master. Hopefully, though, I have communicated the kind of syntax I am hoping for.

bkamins commented 4 years ago

Yes - I think it is clear. Let me summarize how I now think we can handle all the cases (also from the https://github.com/JuliaData/DataFrames.jl/issues/2172 discussion). I describe select but transform is analogous:

@nalimilan - I can quickly implement it if we go for this. I guess all needs of @matthieugomez are satisfied with this design and also the requirement of tracking of row order after producing a data frame. For 0.21 it would be a simple implementation (not the fastest possible), but we can improve performance after 0.21 release.

pdeffebach commented 4 years ago

Thank you for your input on this. I think there are definitely changes to make here.

not available not available
select(df, :, :x => mean) by(df, :id, :, :x => mean)
select(df, :x => mean) by(df, :id, :x => mean)

We do not have an operation to "spread" the mean of a variable across the full data frame, but drop other columns. This is a good lapse and I thank you for pointing it out.

I like by. I think it is readable. Here is my attempt at a totally new syntax that I think would avoid all confusion. Here select does not collapse by default. Rather it "spreads" results to match the input data frame length.

select(df, :x => mean) selectby(df, :id, :x => mean)
transform(df, :x => mean) transformby(df, :id, :, :x => mean)
collapse(df, :x => mean) collapseby(df, :id, :x => mean)
matthieugomez commented 4 years ago

I don't think I entirely agree with @bkamins's synthesis of my proposal. My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations. Moreover dataframes remain aligned: the row number 10 in the previous dataframe refers to the same observation as the row number 10 in the next dataframe.

To be honest, I'm not sure what the main difference between transform and select is in your proposal?

matthieugomez commented 4 years ago

@pdeffebach I'm agnostic on whether we should have a transmute actually. It can simply be obtained by using transform (i.e. a function that keeps all rows and colums) and then select so it's fine to not have it.

For transformby, I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a by kwarg does not work for you. But if people do hate this, I'm happy with just transform(x::GroupedDataFrame, ....), that returns a groupeddataframe, which can be combine'd afterwards, as long as row order remains the same.

pdeffebach commented 4 years ago

My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations.

I fully agree with this. I think that select shouldn't collapse by default and we should have a separate function name for collapseing. Too many keyword arguments will get complicated. I propose collapse if we don't get sued by Stata for using it.

I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a by kwarg does not work for you.

My opposition to keyword arguments is

  1. by is easy to read. transformby(df, :id, ...) reads like a sentence. "by the column :id: transform the data frame...".
  2. These function calls might get very long. I don't want to read
transform(df, 
    :income => mean, 
    :income => std, 
    :educ => first, 
    :startage => first, by = :id)

only to find out the operations are performed by group at the end.

I agree that transformby is hard to type. I wouldn't mind gen being used instead of transform, i.e. gen and genby. But I don't know how much support I would get for my idea.

I'm happy with just transform(x::GroupedDataFrame, ....), that returns a groupeddataframe, and can be combine'd afterwards

I'm happy with this, too. But you have convinced me of the need for collapse and select to be separate, at the very least.

bkamins commented 4 years ago

I had a long discussion with @nalimilan about it but also having read your comments I have updated my thinking a bit and have the following conclusion.

We should have the following functions

So in summary - the major difference from dplyr is that in DataFrames.jl groupby has a slightly different functionality than there and the consequence is that the table you have proposed would look like (assuming by kwarg just to keep one convention):

select(df, :x => mean) select(df, :x => mean, by=:id)
transform(df, :x => mean) transform(df, :x => mean, by=:id)
collapse(df, :x => mean) by(df, :id, :x => mean)

With the only caveat that collapse could produce more than one row depending on what the transformations request. Similarly by.

Then on top of this table we have combine and map on GroupedDataFrame that are an independent functionality serving a different purpose as GroupedDataFrame itself can be transformed as I noted.

matthieugomez commented 4 years ago

Well, that would be awesome...

I guess my only question is: will it complicate the package too much? I realize that you and @nalimilan are basically the only maintainers these days, so I get that we want to avoid a multiplication of methods.

On a related note, I guess I don't really understand why GroupedDataFrame is needed then, if select, transform, collapse get a by argument. Would not it dramatically simplify things to remove it or put it in a separate package?

bkamins commented 4 years ago

GroupedDataFrame is needed because it also:

The basic building blocks of what is proposed above is something we have now:

  1. current combine (which would stay unaltered)
  2. current select (which would be renamed to collapse)

Now (technical implementation migtht differ a bit):

  1. future select and transform without by (so not grouping) would be simply ensure that all output has nrow(df) rows which is very simple (you just need to check if the first column you produce has this number of rows, and the rest is the same as we already make this check anyway but just allowed any number of rows)
  2. future select and transform with by (so grouping) would just use by function but adding one virtual column axes(df, 1), call it :__dummy__ to a data frame that would contain axes(df, 1) values, then we would add just one operation :__dummy__ (i.e. retain __dummy__) column to by processing. In this way we ensure: a) each group may not change its length, b) after by finishes we use idx field of GrupedDataFrame to recover the original order of rows in linear time

So in summary - the change is relatively small. Fortunately none of this functionality exists in 0.20 so we can do whatever we want and the current design for 0.21 is flexible enough to cover all that you request here relatively simply.

bkamins commented 4 years ago

An additional comment based on the discussion with @nalimilan on Slack (this is my understanding of things based on the input from @matthieugomez, @pdeffebach and @nalimilan + my own opinions):

Q1: Why we do do not allow passing GroupedDataFrame to select and transform?

A1: Because select and transform have an invariant that they always return nrow(df) rows and they are in the order of rows in df. The problem is that GroupedDataFrame has sort and skipmissing kwargs that respectively reorder and remove rows from df. Aslo GroupedDataFrame can be subsetted/reordered. Theoretically we could define what consequences reordering and subsetting should have on the result of select taking such a GroupedDataFrame but such rules would be complex, and for sure not easily grasped by the regular user of DataFrames.jl. Also - if we really find it useful in the future adding support for GroupedDataFrame passed to select can be always added as it will be non-breaking.

Q2: Why do we need collapse as a separate function.

A2: since select and transform have a very clear contract "always return nrow(df) rows and they are in the order of rows in df" adding a kwarg to them that would allow to choose if we want all rows or any number of rows would be a mental overload. It is much better to have a separate function that clearly signals that it can change number of rows. Now it is enough to have combine that is similar to select as for transform it does not really make sense to have a "collapsing" behaviour (as we want to keep existing columns which have nrow(df) rows anyway)

Q3: Why do I prefer select and tansform to have by keyword argument rather than define selectby and transformby functions.

A3. This is a tough decision. I was thinking about it and I think that it is better not to pollute the namespace with too many functions. by will be a kwarg, so it should be clear enough what the behaviour is (this is like with select in SQL, and actually better than syntax in dplyr where you have to know what was the type of the input to know how the function will work. Here we will have a clear visual signal of by keyword that we are working in groups).

Q4. Why do we keep combine and map for GroupedDataFrame.

A4. First of all - they are useful. You firs "tweak" your GroupedDataFrame (like subsetting, reordering etc.) and then call them to get a desired result. Also it is better to be backward compatible than be breaking if there is no clear benefit of being breaking. Also map and combine have a bit different API that in particular allows for do-syntax, which again is useful if we are in "collapse" mode.

Q5. Should we keep by or deprecate it and instead define collapse with by keyword argument

A5. After thinking my preference is to deprecate by. The reason is to reduce the name-space pollution, especially by such a short function name. The deprecation would go:

In particular this will resolve the problem I have always had that by(df, :x1, :x1, :x2) is a bit hard to parse - you have to remember that the second positional argument is key, but in the context `by(fun, df, :x1) actually third positional argument is key.

In summary the table we discuss would be expected result ungrouped operation grouped operation
retain number and order of rows, drop old columns select(df, :x => mean) select(df, :x => mean, by=:id)
retain number and order of rows, keep old columns transform(df, :x => mean) transform(df, :x => mean, by=:id)
any number of rows, order of rows determined by the way you perform grouping (sort and skipmissing kwargs in particular), drop old columns collapse(df, :x => mean) collapse(df, :x => mean, by=:id) + old combine and map applied to GroupedDataFrame + deprecate by

additionally we support keepkeys kwarg for all functions (determining if grouping columns should be retained or not). This kwarg should be also added to DataFrame working on GroupedDataFrame.

Now the important thing is that these definitions have a very nice feature that left column (ungrouped operation) is exactly right column (grouped operation) when by=[] (which will be the default) as by=[] essentially creates one group containing a whole data frame. I think it is a nice symmetry showing that the design is consistent.

Please up- or down- vote this proposal. I am now personally convinced to go this way so if no opposing voices will be made I will make a PR implementing this (fortunately it is relatively easy) when #2199 is merged so that we can include it in 0.21 release.

matthieugomez commented 4 years ago

Well I think it's terrific. Thanks a lot @bkamins. (also @jmboehm may be interested).

jmboehm commented 4 years ago

I have limited experience with the current interface, but for what it's worth, I think both the request and the proposal make a lot of sense.

I'm not entirely sure how we are meant to perform operations on subsets of the rows, without deleting the other rows. This is one of the issues I've been facing when implementing a Stata-like interface. But perhaps that's an issue for another day.

Thanks a lot to everyone involved!

pdeffebach commented 4 years ago

@bkamins I think this is a great idea. I still maintain a dislike for keyword arguments.

I have implemented, in the crudest way possible, a non-keyword argument version of all this in a PR here. I use Stata-esque names to avoid name conflicts with existing functions. I think everyone in this thread will be able to understand what each function does.

I think that we should deprecate by with the Pairs argument. However I think

by(df, :a) do sdf
   f(sdf)

is still a very powerful and convenient syntax.

For reference, here is my list of new functions implemented in #2210

@jmboehm Can you file an issue about replicating stata's if syntax? I have some ideas, but we should keep discussion focused on this subject.

nalimilan commented 4 years ago

Just a few comments regarding @bkamins' detailed proposal. Overall I like it, I just have a few reservations:

bkamins commented 4 years ago

@jmboehm thank you for your thoughts. Here are my comments to the questions you have raised.

Or perhaps I'm just generally not very fond of "=>". What does that mean? "implies", "is mapped to", or an assignment operation?

So :a => :b is a shorthand for :a => identity => :b. And :a => fun => :b means pass column :a to fun and store the result in :b.

how we are meant to perform operations on subsets of the rows, without deleting the other rows

use a view

EDIT - oh - I understand you want to do some operation conditionally on some other column?

something like [:a, :b] => (a, b) -> ifelse.(a .< 0, b, 0)?

pdeffebach commented 4 years ago

@nalimilan thank you for this feedback. I think there is a consensus around this kind of proposal, unless there are truly insurmountable edge cases.

Do you imagine transform(gd::GroupedDataFrame) to return a grouped dataframe. Perhaps a keyword argument to all 3 functions (keep, gen, and collapse for now), which preserves grouping and does not return a data frame.

pdeffebach commented 4 years ago

I have created #2211 for discussion about if syntax and subsetting.

matthieugomez commented 4 years ago

I'll just give my two cents about kwargs vs type, but, in any case, I'm happy with either proposal.

On the one hand, I like @nalimilan's proposal because it keeps everything super simple. On the other hand, I'm worried that GroupedDataFrames is so different from DataFrame than it makes too complicated. If you end up, in a future version of DataFrames, allowing a by kwarg in transform, then the whole syntax becomes more complicated in the end.

Is it worth thinking about defining a GroupedDataFrame type that is much more similar to DataFrame, as in dplyr? The current GroupedDataFrame type could then be renamed to GroupedDataFrameIterator, with a function eachgroup(df::GroupedDataFrame) -> GroupedDataFrameIterator

matthieugomez commented 4 years ago

I guess something I don't understand in @nalimilan's proposal is whether the output of transform/select of a GroupedDataFrame is a GroupedDataFrame or a DataFrame.

bkamins commented 4 years ago

@nalimilan - thank you for the comment. I will summarize the possible design under these circumstances in the post that follows (including @pdeffebach's comment about returning a DataFrame or a GroupedDataFrame).

Here let me add some more general comments.

@matthieugomez - I think @nalimilan wants to give an option to choose what should be the output. i also comment on this below.

Just as a note to @pdeffebach (rephrased what I commented in #2210). In DataFrames.jl we want to provide a minimal set of functions that provide the required functionality. Therefore if we end up with the design in which:

by(df, :a) do sdf
   f(sdf)
end

will be possible to achieve in a different and easy enough way we will probably remove it. I know that in the past DataFrames.jl did not always follow this rule strictly and everywhere but as we go for 1.0 this is needed:

For example in #2211 I assume we will clarify what functionality is needed. If is is easily achievable when composing current functionality then probably we will not add it, but if we decide that it is very hard to achieve it without adding something to the "core" of the package then we will add it.

bkamins commented 4 years ago

The issue about what GroupedDataFrame is a focal point here. It is also related to https://github.com/JuliaData/DataFrames.jl/issues/2106 as this issue should be resolved also (e.g. in theory someone might want to expand a 0-row group into something).

Given the amount of decisions that are to be made I start getting a feeling that we will not be able to resolve all issues in a way in 0.21 release, so that we ill not be breaking later. So maybe we will have to decide on allow for breaking changes between 0.21 release and 1.0 release.

What I think the consensus is is what we need for operating on a DataFrame. And this for sure can "go into" 0.21 release. If we decided to stay with this we would for the time being leave by and combine in 0.21 release as legacy and announce that after 0.21 release a redesign of split-apply-combine infrastructure will be done. But maybe (and hopefully) we will quickly settle on the functionality for split-apply-combine part. Then it can also go into 0.21 release.


So the functionality for data frame seems to require three functions, all of them take a a data frame and return a DataFrame:

In this thread of the discussion I think we can concentrate on finding an appropriate name for the third function.


The functionality of the grouped data frame is more complex. The starting point is what @matthieugomez said about adding another type (a la dplyr). The good thing is that currently we do not allow to change GroupedDataFrame in-place (but only to generate a new GroupedDataFrame based on it). If we kept this approach then we could add a second parameter to a GroupedDataFrame that would signal if it in cannonical form (not reordered nor subsetted) or modified in some way (this should be enough for our purposes I think - but maybe I do not see something so please comment). The benefit of such an approach is that we would not redesign everything from scratch.

Below I assume that we will settle on combine name in the "data frame passed" case.

Then following the proposal of @nalimilan:

So essentially the difference in comparison to the earlier proposal is dropping by keyword argument and instead dispatching on type and adding one extra keyword argument that will govern the type of the return value.

pdeffebach commented 4 years ago

I have updated #2210 with the proposal described by Bogumil above, still with the names gen, keep, and collapse. I think it's a very good proposal! I would be very happy to work with it and would trust myself to promote it to new users!

matthieugomez commented 4 years ago
bkamins commented 4 years ago

I guess, there will be a ! versions of select, transform, collapse

for a DataFrame argument yes - and this is already implemented.

For GroupeDataFrame this is problematic. Let me give a quick example of an R session:

> df <- data.frame(g=c(1,1,1,2,2),x=1:5)
> gdf <- group_by(df, g)
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
      g `mean(g)` `mean(x)`
  <dbl>     <dbl>     <dbl>
1     1         1       2  
2     2         2       4.5
> gdf$g[1] <- 3
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
      g `mean(g)` `mean(x)`
  <dbl>     <dbl>     <dbl>
1     1      1.67       2  
2     2      2          4.5

I do not like this design.

In DataFrames.jl we are clear that GroupedDataFrame is a view and views are by definition assuming that their parent is not mutated. Also note that we have SubDataFrames in DataFrames.jl and GroupedDataFrame could be defined on such object.

So - in short:

I don't really see an issue with using transform/select on a non-canonical grouped DataFrame.

Well take this GroupedDataFrame:

julia> df = DataFrame(g=[3,1,2,3,1,2], x=1:6)
6×2 DataFrame
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 1     │
│ 2   │ 1     │ 2     │
│ 3   │ 2     │ 3     │
│ 4   │ 3     │ 4     │
│ 5   │ 1     │ 5     │
│ 6   │ 2     │ 6     │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 3 groups based on key: g
First Group (2 rows): g = 3
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 1     │
│ 2   │ 3     │ 4     │
⋮
Last Group (2 rows): g = 2
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 3     │
│ 2   │ 2     │ 6     │

julia> gdf2 = gdf[[3,2]]
GroupedDataFrame with 2 groups based on key: g
First Group (2 rows): g = 2
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 3     │
│ 2   │ 2     │ 6     │
⋮
Last Group (2 rows): g = 1
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 1     │ 5     │

what should be the result of select(gdf, :x => identity => :x) and select(gdf2, :x => identity => :x)?

What I was suggesting is to create a type GroupedDataFrame that inherits from AbstractDataFrame.

I understand this, but could you please specify exactly how this type should behave exactly (should it be a view or a copy, should it precompute groups on creation and store them or only carry over information about grouping columns but do not store them, should it have a special column type for grouping columns that make them read only, should you be allowed to add rows to it, should you be allowed to add columns to it, should you be allowed to remove columns from it, etc. - probably the list here is longer, but these are probably major questions). Maybe you have answers for these questions and then we can quickly move forward. If not, and if you find the current design (trying to slightly modify current GroupedDataFrame only without designing a new type) insufficient, then what you propose should be discussed but it is probably several months of heavy design work (which is OK, but it means that we go for the option: release 0.21 without significantly modifying split-apply-combine ecosystem, and work on the breaking changes after that release).

matthieugomez commented 4 years ago

in the proposal by @nalimilan with GroupedDataFrame passed around these functions are problematic to provide. What you would probably do is generate a 1-column data frame and in the second step add this column to the original data frame

This does not sound good. I think it's a good reason to prefer kwargs.

bkamins commented 4 years ago

Well technically such select! could work on a cannonical GroupedDataFrame{DataFrame} but would have to be disallowed for GroupedDataFrame{SubDataFrame}. However, I found it a bit confusing (but now that I think of it maybe this could be allowed with this distinction).

Let me summarize the pros and cons of both proposals:

Using kwarg by:

Not using by kwarg, but allowing passing a grouped object to functions:

As I have written it down I feel I would prefer option 2 in the long term (as proposed by @nalimilan), but for it to work we need to work out the "grouped data frame" case.

You have not commented on the problems I see with select on "non-cannonical" data frames and the problems I see with the current design of the result of group_by in dplyr. It would be valuable if you gave your opinion for it. But assuming you agree with me and we stay with the current GroupedDataFrame (which is easiest implementation-wise) then we could have:

then select!, transform! methods would mutate the parent of the GroupedDataFrame (which would automatically be reflected in GroupedDataFrame as it always takes all its columns).

bkamins commented 4 years ago

I was thinking about it today some more and my conclusion is the following:

  1. we should split "core" of DataFrames.jl (low level API) into a separate package (as commented in #1764). The reason is that things we discuss here are convenience functionalities and there are many possible competing (and compelling) frameworks for it. In particular we reserve names like select or transform which should be free to use by extension packages.
  2. Still in DataFrames.jl we should provide a reference implementation of high-level API.

Here I write what conclusion I have for this high level API currently (but bear in mind that given what I have just written actually I think it is encouraged to design other high-level APIs that could be built on top of the "core" low level API):

We should provide 5 functions (in the scope of this issue):

The signatures would be:

regroup can be true in which case a cannonical GroupedDataFrame is returned, if false (the default) a DataFrame is returned. We allow for this kwarg because it is more efficient to regroupimmediately (we know how to group without having to compute grouping again). If regroup=true we throw an error if keepkeys=false (it does not make much sense otherwise)

Other kwargs have current meaning.

Note that for select! and transform! for GroupedDataFrame it does not matter much what we set for regroup as both parent DataFrame and GroupedDataFrame passed will be modified.

Now why I opt for combine name for the last operation. First - it is non breaking. Second, we will read it as "combine rows", which makes some sense (none of the functions we considered were ideal).

Given this design both by and map for GroupedDataFrame would be deprecated (by duplicates functionality that is easy to get otherwise, map will be just combine with regroup=true). In this way we are free to decide what would be the use of map.

Finally DataFrame on GroupedDataFrame would get keepkeys kwarg.

Sorry that the posts are lengthly, but we are desinging a complex ecosystem and details matter; I hope I have not mixed up something in the descriptions. I hope this design is something you find acceptable and useful. If yes. After #2199 I would go forward to implement it.

Also as I have noted above after 0.21 I would go forward to decouple all this from the low-level API (that would be moved to DataFramesBase.jl) to allow other high-level APIs to be implemented (this one would be just a reference implementation - still with the aim to be useful).

matthieugomez commented 4 years ago

Honestly I don’t know enough about how people use groupeddataframes to bring substantive points to this discussion (kwarg vs gdf). That being said, the final proposal looks good to me. Thanks a lot — especially for considering making transform! work with grouped data frames.

pdeffebach commented 4 years ago

+1 to all of this. I can't comment on the contract about mutating the parent of a grouped data frame, but if that's surmountable it would be great. I went over your R example and also find the behavior unintuitive. Perhaps we can disallow modifying a key column.

Thank you everyone for their detailed proposals and thoughtful comments.

bkamins commented 4 years ago

the contract about mutating the parent of a grouped data frame

It is "surmonutable" as you say for "cannonical" GroupedDataFrame as it is 1 to 1 mapping to the parent data frame (no rows are removed/reordered) and fortunately GroupedDataFrame does not subset columns. Finally select! and transform! guarantee not to reorder/subset rows. This means that if we modify the parent the GroupedDataFrame will still be valid (this prompts me that probably keepkeys should be disallowed in select! and transform! and it should be always true (in other words - if you want to do a ! operation on GroupedDataFrame you are not allowed to remove grouping columns from the parent), as otherwise derived GroupedDataFrame would be invalidated - I guess it is not a problem and is an intuitive case).

Taken all this into consideration the only risk is that some other GroupedDataFrames would be backed by the same parent but with different grouping columns - and that other GroupedDataFrame would get invalidated, but I think we can be explicit enough in the documentation to warn users about this case.

Note that this is a different situation than the one we discuss in #2211. The problem is that if you add a column to a SubDataFrame with some name, this column potentially can exist in the parent data frame so there would be a conflict (still this also is fixable but first I would like to understand in #2211 if this is really needed to be added).

The general thinking is that view should not modify parent (if possible) as there are potentially other views based on the same parent that might get invalidated and it is easy to forget about it (but as I have said - for GroupedDataFrame I think we can allow this as this is a very specific use case - as opposed to setindex! in #2211 which is a very fundamental operation).

Perhaps we can disallow modifying a key column.

In the proposed design the problem that dplyr has does not exist (we explicitly check if grouping column has remained unchanged in combine which will be still a workhorse of the whole solution). We would have this problem if we created a new type DataFrameWithGroups (call it tentatively) that would be an AbstractDataFrame but with information about grouping columns. If we went this way (we currently do not, but some "extra" package is free to define it) then grouping columns should probably be stored as https://github.com/bkamins/ReadOnlyArrays.jl. But this design has some problems as when we make an array read only we lose its type information and Julia does not allow multiple inheritance currently and most of array types are not trait based, chiefly CategoricalArray and PooledArray would be a problem and they are important for performance reasons. Again - these cases could be worked around, but I did not want to overly complicate things on top of the current design.

Thank you all for discussing this.

nalimilan commented 4 years ago

Sounds like a good plan! I think discussions about special cases can be handled later, like what to do with "non-canonical" GroupedDataFrame (in which groups have been dropped or reordered).

Regarding the idea of having GroupedDataFrame <: AbstractDataFrame, that was my goal originall. But I asked Hadley Wickham and he said he would have liked to change this in dplyr so that group_by returns an object which is not a data.frame. Then I realized that we don't really need GroupedDataFrame to behave like a DataFrame for most basic operations: for example, it's not very useful to have gdf.col return its parent's column since that doesn't help you to perform by-group operations. OTOH what is definitely useful is to have select and transform work like for DataFrame, but operating by groups (what we are discussing here).

@matthieugomez Do you see something in particular that having GroupedDataFrame <: AbstractDataFrame would allow that wouldn't be possible otherwise?

matthieugomez commented 4 years ago

@nalimilan I'm not sure. My initial reaction is that I find it confusing to do gdf=groupby(df), see something in the REPL that looks completely scrambled, but then have tranform!(gdf) returns the original df with an added column. But maybe I'm wrong and it is not that confusing.

bkamins commented 4 years ago

As a side note groupby(df) is not allowed.

Now - the issue you report is printing related. The original thinking (not mine - it was implemented long before I started working on DataFrames.jl) was that it is more useful to show groups in GroupedDataFrame rather than one table just with e.g. information about grouping columns and number of groups. But we can change it. If you have some better proposal for show please open a separate issue as it is only output related.

matthieugomez commented 4 years ago

@bkamins My point about printing was really not intended to be a proposal, just an answer to @nalimilan's question. I am really not knowledgeable enough to have a well formed proposal about GroupedDataFrames.

That being said, if we are all still hesitant about how to work with GroupedDataFrames, and you want DataFrames to be 1.0 soon, maybe you could consider separating the type GroupedDataFrame in a different package (together with its methods for select/transform/combine).

bkamins commented 4 years ago

maybe you could consider separating the type GroupedDataFrame in a different package

This is exactly the plan, these functions would not go to DataFramesBase.jl.

if we are all still hesitant

Well - I am not hesitant, though I understand that different users might find different things useful. Note however, that there is little value added of making GroupedDataFrame a subtype of AbstractDataFrame (you can always use its parent to do whatever you like). While the key benefit of GroupedDataFrame is a fast lookup of groups with a convenient interface for it:

julia> df = DataFrame(a=repeat(1:3, 4), b=repeat(1:2, 6), c = 1:12)
12×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 1     │ 3     │
│ 4   │ 1     │ 2     │ 4     │
│ 5   │ 2     │ 1     │ 5     │
│ 6   │ 3     │ 2     │ 6     │
│ 7   │ 1     │ 1     │ 7     │
│ 8   │ 2     │ 2     │ 8     │
│ 9   │ 3     │ 1     │ 9     │
│ 10  │ 1     │ 2     │ 10    │
│ 11  │ 2     │ 1     │ 11    │
│ 12  │ 3     │ 2     │ 12    │

julia> gdf = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = 1, b = 1
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 1     │ 1     │ 7     │
⋮
Last Group (2 rows): a = 3, b = 2
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 6     │
│ 2   │ 3     │ 2     │ 12    │

julia> gdf[(a=2, b=1)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 5     │
│ 2   │ 2     │ 1     │ 11    │

If we wanted GroupedDataFrame a subtype of AbstractDataFrame this interface would be problematic and a fast lookup by key is a functionality that is very useful in many contexts so it should be easily accessible.

As a note gdf[(a=2, b=1)] is not only convenient (as this is subjective) but it is also very fast (as fast as Dict lookup in Base). To be honest - I am not really clear how you can achieve it quickly (i.e. in a way that is fast) in dplyr - if you know and could comment on it it would be an interesting comparison.