JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Grouping API consistency and improvements #1256

Closed nalimilan closed 4 years ago

nalimilan commented 7 years ago

Our grouping API is currently lacking, as it does not allow performing most common operations in the most efficient way (see this discussion and this one), and because it is not very user-friendly for all cases.

Comparing with other software, our API is quite similar to Pandas, which provides three different functions (see also this summary):

dplyr takes a quite different approach:

Overall, our API is quite similar to Pandas, though we do not provide an equivalent of their aggregate for the simplest case where a function always returns a scalar. It is not clear whether we need it for performance, or whether we could use inference in our aggregate to automatically use a fast path when possible (see above).

On the contrary, our API is quite different from dplyr, which provides convenient functions to create new columns based on per-group operations (aggregation or arbitrary transformations), using a syntax similar to what Query or DataFramesMeta provide. Of course our by allows this kind of operation, but it makes it hard or impossible to do this efficiently since inference and specialization are problematic (see above). This kind of problem can only be solved using macros in order to have access the the names of columns which need to be used (and specialized on). So it is probably better left to Query and DataFramesMeta. So I'd say we should concentrate on ensuring our limited Pandas-like API is consistent and efficient, and point to other packages in our documentation for use cases which cannot be efficiently handled through it.

One big limitation we currently have compared with both Pandas and dplyr is that a simple case like computing a grouped mean for a single column is inconvenient and slow. Pandas allows accessing the pseudo-column of a grouped data frame and calling mean() on it. dplyr allows a similar operation by calling mutate(grouped_df, m = mean(col)). On the contrary, we require either using by, which is slow and not so convenient, or using aggregate, which applies the operation to all columns (and fails when some are not numeric, which should be fixed).

Ideas for possible improvements:

xiaodaigh commented 7 years ago

I did some research and managed to beat data.table in some groupby cases.

Key areas to work on are string hashing and better parallelization. There may be a way to merge the work in FastGroupBy.jl into the Dataframes by rewriting the grouping backend to Radixsort based for bit types.

See post https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

See code here https://github.com/xiaodaigh/FastGroupBy.jl

nalimilan commented 7 years ago

Interesting. Let's discuss this elsewhere as this issue was supposed to deal about APIs, not implementations.

nalimilan commented 6 years ago

A good solution to make by faster would be to add an AbstractDataFrame type which would encode column type information, and which would be used to pass sub-data frames to the user-provided function. That would allow specializing the function on the types of the inputs. I have done some experimentations by wrapping a NamedTuple.

pdeffebach commented 6 years ago

Bumping this. I have been playing around with aggregate recently and found something that made my computer stall

df = DataFrame(rand(50_000, 6000));
df[:id] = rand(1:25_000, 50_000); # 2 observations per group, on average
aggregate(df, :id, mean) # takes a very long time

I am not sure how contrived an example this is, but it's one we are more or less running into with our work in Stata, and Stata isn't handling it that well either. I think the error is that the GroupedDataFrame operation isn't using all the type information it could be using.

https://github.com/JuliaData/DataFrames.jl/blob/1c468d6dd6d7e090306628173d436ab66b8387f7/src/groupeddataframe/grouping.jl#L370

You can see an Any[] there.

bkamins commented 5 years ago

@nalimilan what is left from this issue to be done?

nalimilan commented 5 years ago

I think several points still need to be addressed, e.g. regarding aggregate. I'd also like to check that the API is OK before closing this.

bkamins commented 5 years ago

👍 Thank you!

nalimilan commented 5 years ago

Just leaving a note that it could make sense to replace aggregate(gd, f) with something like combine(gd, All() => f) or combine(gd, : => f), with All or : a special selection object similar to JuliaDB selectors. That would make the vocabulary much easier to grasp than using arbitrary terms like aggregate (combine and aggregate mean the same thing). dplyr uses summarize_all for this. This could be extended using other kinds of selectors like in JuliaDB.

The only question is what we also need a replacement for aggregate(df, f). I guess we could support combine(df, ...) for all variants of combine, making it work as if df was the only group in a GroupedDataFrame (and without grouping columns).

bkamins commented 5 years ago
  1. I am in favor of removing aggregate for df. We have mapcols which is cleaner in this respect (does not cover all functionality, but for me everything above what mapcols does is too much magic).

With combine I have a mental problem that => syntax is reserved for applying f to what is on LHS.

So maybe:

nalimilan commented 5 years ago

I am in favor of removing aggregate for df. We have mapcols which is cleaner in this respect (does not cover all functionality, but for me everything above what mapcols does is too much magic).

That's tempting, but both dplyr's summarize and Pandas' aggregate support data frames and grouped data frames, so I think we also need a function working on both.

With combine I have a mental problem that => syntax is reserved for applying f to what is on LHS.

Good point, so indeed : => f is not really correct. Conceptually that would be : .=> f (along the lines of https://github.com/JuliaData/DataFrames.jl/pull/1620), which of course cannot work. combine(gd, All(collection) => f) would be OK. Though a problem is that All(collection) => f could also mean we want to pass all arguments in collection to f. That's the same thing as collection => f for All, but we could support other operations like Between and Not as in JuliaDB, in which case the meaning is less obvious. Of course it's not very likely that you would like f to take a list of variables that you cannot list one by one, but would it be OK to not even support that use case?

bkamins commented 5 years ago

so I think we also need a function working on both.

I am OK to do the way you want to go (I just wanted to hint that mapcols does very similar thing)

By All(collection) I meant something as All([:a, :b]) or All([[:a,:b], [:c,:d]]) (to pass two columns two times).

Now I actually see that it can be even simpler like All(:a, :b) or All([:a, :b], [:c, :d]) and even (if we want to be extra fancy) All(targetcolname1=[:a, :b], targetcolname2=[:c, :d]). And then still All() could be a shorthand for all columns.

Edit by All(:a, :b) I mean apply f to column :a and to column :b producing two results.

nalimilan commented 5 years ago

The last step to complete to close this issue is to deprecate aggregate. For GroupedDataFrame, it can be replaced with combine(gdf, All() .=> f) (thanks to https://github.com/JuliaData/DataAPI.jl/pull/10). But there's also a DataFrame method, for which combine isn't the best term. A more natural deprecation for it would be select(gdf, All() .=> f). We could imagine supporting combine on DataFrame too, treating them as a single group, but that can be added later if it turns out it's a useful generalization.

bkamins commented 5 years ago

Two comments:

  1. All() in combine should "drop" grouping columns, but if All is given some arguments it should work normally (this means that All() for DataFrame and GroupedDataFrame work consistently if we use the definition that All() captures all non-grouping columns, as in DataFrame there are no grouping columns)
  2. With select we should coordinate with JuliaDB.jl (as it works elementwise there AFAICT).

@piever - regarding the second point. We need two kinds of transformation operations:

  1. working elementwise (rowwise)
  2. working on whole columns

How do you handle this distinction in JuliaDB.jl (it seems that select works rowwise, then the question is how do you handle transformations that take a whole vector and transform it?)

bkamins commented 5 years ago

@piever This is my current conclusion from he discussion with @nalimilan what would be best for DataFrames.jl. Please comment what you think of it (point 1 is probably the most relevant to JuliaDB.jl):

  1. select would work on columns of a DataFrame, and deprecating aggregate in this case
  2. I would add map for a DataFrame would work rowwise (also note that if a function works on vectors you can always work on rows with it using broadcasting)
  3. I would make combine deprecate aggregate for GroupedDataFrame

In this way map and select perform two different functions (and are naturally complementary). If they both work rowwise then they have an overlap of the functionality.

I think it would be great to settle this before DataFrames.jl 1.0 release (so hopefully soon).

piever commented 5 years ago

select would work on columns of a DataFrame, and deprecating aggregate in this case

select in IndexedTables is a very complicated beast, in that select(t, v::AbstractVector) returns v, select(t, i::Union{Int, Symbol, SpecialSelector}) selects columns, passing a column name and function maps the function row-wise and select(t, tup::Tuple) returns a table with the various selected vectors for each element in the tuple (see here).

select is often used in combination with other parts of the API, for example, if I do groupby(f, t, by, select = :a => f) it first computes select(t, :a => f) and then uses that as data for groupby. Because of this, changing select would be very breaking (a lot of API functions have a select keyword that always calls the select machinery).

Among others, transform inherits the same API: one can do transform(t, :a => v) (replace a column by a vector) or transform(t, :a => :b => f) ( replace a column by a function applied element-wise). When adding transform to DataFrames, do you plan on having it work column-wise like like select or row-wise? In my mind, it's important that the two functions are consistent.

In JuliaDB working on the columns directly is IMO not as common as in DataFrames as it doesn't work well with the distributed case. The main function to do this are groupby and summarize (which is similar to DataFrames.aggregate and is just syntactic sugar on groupby). So actually, the usecase you have in mind for colwise select is simply groupby with by = () in JuliaDB.

I suspect it'd be nicer, just like there is reduce and groupreduce to have apply and groupapply (where apply is DataFrames' colwise select) or, even better, reduce(f, group(df, key)), apply(f, group(df, key)).

Would apply, groupapply (or apply(f, group(df))) (for some name of apply, maybe this is not the best one) be a sensible choice instead of the proposed select and by for DataFrames or is it too late in the game for this kind of changes? It'd be great if we could also fix the inconsistencies with the grouping terminology between DataFrames and JuliaDB.

I would add map for a DataFrame would work rowwise (also note that if a function works on vectors you can always work on rows with it using broadcasting)

I'm not sure I'm particularly qualified to comment on this one. On the one hand, if DataFrames don't iterate rows (you need eachrow) and their size is 2D, then I'm not sure why map would work on rows, for consistency one should do map(f, eachrow(df)) I guess, esp. since otherwise map would change the shape of the container. On the other hand, this feels useful in practice and it's odd to have filter but not map.

I would make combine deprecate aggregate for GroupedDataFrame

Even though JuliaDB has the syntactic sugar summarize, I think it's not a matter of having a new function but an appropriate "special selector". The All() .=> f idea is brilliant, esp. since the one can have Number .=> mean and things like that (to only choose numeric columns). As discussed above though, I'd rather call this apply/groupapply.

bkamins commented 5 years ago

Thank you for the extensive and insightful comment. I will add two short notes (probably @nalimilan will have more to say on this):

piever commented 5 years ago

(we have not considered reduce as I think it is mostly relevant in distributed context so there is no high pressure to add it now to DataFrames.jl)

One more tiny comment: I think that DataFrames actually uses something like groupreduce to optimize groupby for a set of commonly used operations, see for example here. Instead, IndexedTables accepts that groupby(:a => sum, t, :b) is slow and recommends instead groupreduce(:a => +, t, :b) (using OnlineStats for things that are a bit more complicated, like mean or higher moments). Still, I agree that this can be a post 1.0 thing.

nalimilan commented 5 years ago

Thanks. That's so hard! I think there are two different but related issues:

Actually the former issue has much deeper implications than the latter, so I've filed a separate issue to continue the discussion: https://github.com/JuliaData/DataFrames.jl/issues/1952.

The discussion about aggregate is much less problematic: at worst, we can keep aggregate, it doesn't harm much even though it is largely redundant with the more general API we want to introduce.

I must say I'm not really convinced by the proposal to use apply instead of combine: I find the name too general, that doesn't even imply that grouping is dropped; and it's the name of an operation in Pandas that is equivalent to our combine(f, df) (see OP), which increases confusion. summarize is a good term in most cases, but since we also allow returning multiple rows it's not perfect either. combine is actually quite accurate, the only problem is that it doesn't make a lot of sense when there's a single DataFrame: it's only OK if you think of it as a special case of a single group. So basically I'd say it's OK if we don't deprecate aggregate. I don't have a great solution right now... :-)

Regarding the standardization of the API between DataFrames and JuliaDB, there are several issues:

I think it would be fine to add reduce over GroupedDataFrame in the future, and maybe even groupreduce as a shorthand. But in exchange for this and for not defining select and transform with conflicting meanings, I would kindly ask JuliaDB to rename groupby to something else (why not by?), as this is an obvious clash in meanings (and our name is consistent with precedent in dplyr).

bkamins commented 5 years ago

Sticking to DataFrames.jl issues 😄, then I guess we stay with aggregate for now. The discussion in #1952 can probably be concluded post 1.0.

So can we close this issue for now or there are some outstanding things to be discussed that are not covered in #1952?

nalimilan commented 5 years ago

I think it would be wise to check that we have a rough idea of where we want to go before 1.0, just in case it requires deprecating something (apart from aggregate).

bkamins commented 5 years ago

I agree, but what is the list of things to decide apart from https://github.com/JuliaData/DataFrames.jl/issues/1952 and the newly opened https://github.com/JuliaData/DataFrames.jl/issues/1953 (the latter does not have to go in by 1.0).

What I mean is that I would prefer to keep a list of atomic things to do for 1.0 with a milestone set for them, as then it is easier to track what is left to do.

piever commented 5 years ago

I think it would be fine to add reduce over GroupedDataFrame in the future, and maybe even groupreduce as a shorthand. But in exchange for this and for not defining select and transform with conflicting meanings, I would kindly ask JuliaDB to rename groupby to something else (why not by?), as this is an obvious clash in meanings (and our name is consistent with precedent in dplyr).

Of course there needs to be consensus with the other JuliaDB maintainers, but, as far as I'm concerned, I agree that the name groupby is unfortunate: it is inconsitent with DataFrames and with JuliaDB itself (groupreduce is reduce applied group-wise, groupby is not by applied group-wise).

To me, the only big thing that need deciding soonish to settle this is whether there is a good name for the equivalent of reduce for a function that requires the whole vector and may give a vector or scalar ("colwise select"). Basically, the function that's used internally so that :a => mean applied to a DataFrame gives the mean of column :a. The confusion seems to be that DataFrames has a concept of grouped data (which JuliaDB does not) and this function should merge groups.

The name combine makes sense here, but less so for JuliaDB (in some sense grouping there is given by primary keys, so no need to explicitly combine, the grouping will just be kept in the result: maybe this is more similar to dplyr). It also reads a bit odd in the ungrouped case.

I had proposed apply but it seems inadequate in the grouped case. Maybe apply(f, df; combine = true) (not sure what is the best default for the keyword) could replace combine and also be used as a "colwise operation dropping columns" (as select is not ideal due to name clash with JuliaDB)? Or even apply(f, df; combine = true, keepcols = false) to encompass combine, colwise select and colwise transform? This would solve the colwise issue in #1952 and make combine no longer necessary as far as I understand.

Personally, I like that apply(f, groupby(df, :a), combine = true) is clearly the group-apply-combine strategy. On top of that, apply(f, groupby(df, :a), combine = true, keepcols = true) would be a nice way to add columns computed as groupwise window functions to a DataFrame (at the moment it's not super-easy as far as I understand).

by(df, cols, args...; kwargs...) would then be exactly apply(df, groupby(df, cols), args...; kwargs...). In JuliaDB land it would make more sense to call this operation groupapply (in analogy with groupreduce), but I guess by is also a fine name.

nalimilan commented 5 years ago

I'm afraid this is a very JuliaDB-esque proposal. :-)

DataFrame objects don't store any grouping information, and GroupedDataFrame objects are completely different from DataFrame. So having a combine keyword argument would be a bit weird IMHO (besides being type-unstable). Also, GroupedDataFrame is a collection of SubDataFrames representing groups, so we already have map to apply a function to each SubDataFrame. Actually, the canonical split-apply-combine in DataFrames (for which by is a shorthand) is groupby-map-combine; we support combine with functions/pairs because it's more efficient to avoid having map create an intermediate GroupedDataFrame before combining.

If we added a combine keyword argument, that would only be for interoperability with JuliaDB, but not as an idiomatic DataFrames API AFAICT. But I guess that conversely, JuliaDB could add combine for the same reason.

Also as I said I don't really like apply as it doesn't indicate what it does: how does it differ from map, which sounds like a synonym? We could as well extend mapcols for this, at least it would be explicit that it operates over columns.

On top of that, apply(f, groupby(df, :a), combine = true, keepcols = true) would be a nice way to add columns computed as groupwise window functions to a DataFrame (at the moment it's not super-easy as far as I understand).

What do you mean by "groupwise window functions"?

piever commented 5 years ago

Also as I said I don't really like apply as it doesn't indicate what it does: how does it differ from map, which sounds like a synonym? We could as well extend mapcols for this, at least it would be explicit that it operates over columns.

I'm not completely sold on mapcols as it is different in the case with just one function, but I see that maybe we just need a good concept of grouped table in JuliaDB and then one would call map on that, so by is actually more of a groupmap (at least in JuliaDB, as vectors are not flattened together by default). In which case I guess it would be fine to repurpose groubpby in JuliaDB to do the grouping and then just call map on it and have by(f, df, cols) be map(f, groupby(df, cols)).

What do you mean by "groupwise window functions"?

I simply meant things like by(iris, :Species, :SepalLengthRank => :SepalLength => ordinalrank), i.e. adding a new column of the same length of the old one in a by while keeping other columns. But maybe my point does not hold, in that the keepcols = true might as well be added to by.

nalimilan commented 5 years ago

I'm not really sure mapcols is the right operation for this. I just find apply not explicit enough. Need to think about other possibilities.

I simply meant things like by(iris, :Species, :SepalLengthRank => :SepalLength => ordinalrank), i.e. adding a new column of the same length of the old one in a by while keeping other columns. But maybe my point does not hold, in that the keepcols = true might as well be added to by.

by and combine keep grouping columns by default. Actually there's no way to avoid them, but there's a PR to add an argument: https://github.com/JuliaLang/julia/issues/30901

tkf commented 4 years ago

Comparing with other software, our API is quite similar to Pandas, which provides three different functions (see also this summary):

  • groupby is similar to our function. However, the result allows accessing its "columns" via indexing, and calling a reduction function like mean on them computes it for each group.

Let me add that pandas' groupby is slightly different from DataFrames.jl as it returns an iterable of key-group pairs rather than iterable of groups, as in DataFrames.groupby.

In [1]: import pandas
   ...:
   ...: df = pandas.DataFrame(
   ...:     {
   ...:         "a": [1, 1, 2, 2, 3, 3],
   ...:         "b": [1, 2, 3, 4, 5, 6],
   ...:     }
   ...: )
   ...:
   ...: for key, group in df.groupby("a"):
   ...:     print("key =", key)
   ...:     print("group =")
   ...:     print(group)
   ...:     print()
key = 1
group =
   a  b
0  1  1
1  1  2

key = 2
group =
   a  b
2  2  3
3  2  4

key = 3
group =
   a  b
4  3  5
5  3  6

I find this useful sometimes. Of course, with DataFrames.jl, I can pull out the key from SubDataFrame very easily if I know cols passed to groupby. But I feel it's worth considering to have some API equivalent to the iterable returned by pandas.DataFrame.groupby, if there is nothing equivalent already.

(I don't know DataFrame.jl enough to say for sure there is no function that does this out-of-the-box. I hope my comment is not adding unnecessary noise.)

davidanthoff commented 4 years ago

The Query.jl way of doing this is to have a key function that returns the key of a given group. I assume that API would also work here, if key was defined for SubDataFrame?

tkf commented 4 years ago

if key was defined for SubDataFrame

Yes, that would be equivalent and nice to have.

bkamins commented 4 years ago

@tkf do you mean functionality implemented in https://github.com/JuliaData/DataFrames.jl/pull/1908 (it should be merged in a few days and included in release v0.20)

CC @jlumpe

tkf commented 4 years ago

Yes, exactly that. Thanks!

xiaodaigh commented 4 years ago

The Query.jl way of doing this is to have a key function that returns the key of a given group. I assume that API would also work here, if key was defined for SubDataFrame?

I found Query.jl's performance on large group by problems to be slower. In fact, I would say much slower. As long as can keep the performance at a good level

tkf commented 4 years ago

@xiaodaigh Your work on extensive benchmarking of group-by is valuable to many people. But I believe this thread is about API, not implementation or performance. If you have a strong concern about Query.jl's key accessor function-based approach (which I don't get why), maybe #1908 is the right place to discuss?

xiaodaigh commented 4 years ago

about API, not implementation or performance

I am just not sure if a certain API will lead to slower implementation e.g. Maybe an iteration based API will be slower?

davidanthoff commented 4 years ago

Having a key function (or something like it) is not tied to an iterator based implementation at all.

bkamins commented 4 years ago

Additional to-do / to-decide before 1.0 release:

bkamins commented 4 years ago

I am closing this as I do not see anything not covered elsewhere (or not fitting the current design). If you feel something is missing then please open a new issue (possibly giving a link to this one for past reference).