Closed Shekharrajak closed 7 years ago
Still working on improving this PR. Any kind of suggestions are welcome. Ping @zverok
I believe the entire approach is wrong. The whole idea of #152 is to get rid of dedicated GroupBy
class. So, the #summarize
should be implemented as a DataFrame
method, working on multi-index dataframes.
So, after the problem fully resolved, df.group_by
will just immediately return multi-index dataframe, and summarize
can be used on it. Or on any other multi-index dataframe.
We are trying to have less special cases and objects, not more.
So, after the problem fully resolved, df.group_by will just immediately return multi-index dataframe, and summarize can be used on it. Or on any other multi-index dataframe.
@zverok , I am using final dataframe (which is the result of group_by
) @df
, in apply_method_on_colmns and apply_method_on_df methods.
It can create multi-index dataframe as well(since it is using @keys
as the dataframe index
).
Example
irb(main):007:0> df.group_by(:employee, :month).summarize(salary: :sum)
=> #<Daru::DataFrame(6x1)>
salary
Jane July 600
June 500
John July 1200
June 1000
Mark July 600
June 700
In this case @keys
is [[Jane, July], [Jane, June], [John, July], [John, June], [Mark, July], [Mark, June]]
In this example :
irb(main):006:0> df.group_by(:employee).summarize(salary: :sum)
=> #<Daru::DataFrame(3x1)>
salary
Jane 1100
John 2200
Mark 1300
@keys
is [[Jane], [John], [Mark]]. So it creates single index dataframe.
I am using final dataframe (which is the result of group_by)
@df
Yeah, I can read the code, I swear :)
But the point is summarize
should be the method available for ANY dataframe, despite of grouping feature. This way "grouping" and "summarizing" will became two simple, clean features, not relying on each other.
Means summarize
must use only resultant dataframe @df
. It must be independent from the group_by
methods, attributes. Dataframe @df
itself says everything about group_by
parameters like @keys
, @ non_group_vectors
etc.
summarize
method must be defined in the dataframe.rb
and group_by#summarize
should call the dataframe#summarize
for the summary. Right? @zverok
Main summarize method must be defined in the dataframe.rb
Right.
and group_by#summarize should call the dataframe#summarize for the summary.
Yes. At this point, dedicated GroupBy
class should be dismissed. group_by
should immediately return grouped DataFrame, on which any method could be called (including, of course, summarize
)
@Shekharrajak I believe that your example is weird as :hankey:. There shouldn't be such thing as 1-level MultiIndex. There shouldn't be such thing as MultiIndex with repeating tuples. I've added https://github.com/SciRuby/daru/issues/342 to fix it.
I think 'Aggregate' is probably a better method name than summarize for you're doing?
I think 'Aggregate' is probably a better method name than summarize for you're doing?
Actually, makes sense to me! Let's rename :)
I will rename it.
There is no blank line but travis showing the error. Weird!
Now errors are fixed. Here is some examples :
irb(main):002:0> dataframe = Daru::DataFrame.new({
irb(main):003:2* employee: %w[John Jane Mark John Jane Mark],
irb(main):004:2* month: %w[June June June July July July],
irb(main):005:2* salary: [1000, 500, 700, 1200, 600, 600]})
=> #<Daru::DataFrame(6x3)>
employee month salary
0 John June 1000
1 Jane June 500
2 Mark June 700
3 John July 1200
4 Jane July 600
5 Mark July 600
irb(main):006:0> dataframe.group_by([:employee]).aggregate(salary: :sum)
=> #<Daru::DataFrame(3x1)>
salary
Jane 1100
John 2200
Mark 1300
irb(main):007:0> dataframe.group_by([:employee]).aggregate(salary: :sum).index
=> #<Daru::Index(3): {Jane, John, Mark}>
irb(main):008:0> dataframe.group_by([:employee, :month]).aggregate(salary: :sum)
=> #<Daru::DataFrame(6x1)>
salary
Jane July 600
June 500
John July 1200
June 1000
Mark July 600
June 700
irb(main):012:0> dataframe.group_by([:employee]).aggregate(
irb(main):013:1* salary: :sum,
irb(main):014:1* month: ->(vec) { vec.to_a.join('/') })
=> #<Daru::DataFrame(3x2)>
salary month
Jane 1100 June/July
John 2200 June/July
Mark 1300 June/July
irb(main):025:0> dataframe.group_by([:employee]).aggregate(
irb(main):026:1* salary: :sum,
irb(main):027:1* month: ->(vec) { vec.to_a.join('/') },
irb(main):028:1* mean_salary: ->(df) { df.salary.mean },
irb(main):029:1* periods: ->(df) { df.size })
=> #<Daru::DataFrame(3x4)>
salary month mean_salar periods
Jane 1100 June/July 550.0 2
John 2200 June/July 1100.0 2
Mark 1300 June/July 650.0 2
I'll review entire solution in details later today, but in the meantime: don't you think it's time to finally drop GroupBy
class completely? Due to new methods, it became just a useless proxy.
Few more examples (aggregate on dataframe) :
irb(main):001:0> idx = Daru::CategoricalIndex.new [:a, :b, :a, :a, :c]
=> #<Daru::CategoricalIndex(5): {a, b, a, a, c}>
irb(main):002:0> df = Daru::DataFrame.new(num: [52,12,07,17,01], index: idx)
=> #<Daru::DataFrame(5x2)>
index num
0 a 52
1 b 12
2 a 7
3 a 17
4 c 1
irb(main):003:0> df.aggregate(num_100_times: ->(df) { df.num*100 })
=> #<Daru::DataFrame(5x1)>
num_100_ti
0 5200
1 1200
2 700
3 1700
4 100
irb(main):008:0> df.aggregate(num_100_times: ->(df) { df.num*100 }).class
=> Daru::DataFrame
irb(main):009:0> df.aggregate(num_100_times: ->(df) { df.num*100 }).index
=> #<Daru::Index(5): {0, 1, 2, 3, 4}>
irb(main):011:0> df.aggregate(num_100_times: ->(df) { df.num*100 }).index.to_a
=> [0, 1, 2, 3, 4]
irb(main):002:0> cat_idx = Daru::CategoricalIndex.new [:a, :b, :a, :a, :c]
=> #<Daru::CategoricalIndex(5): {a, b, a, a, c}>
irb(main):003:0> df_cat_idx = Daru::DataFrame.new({num: [52,12,07,17,01]}, index: cat_idx)
=> #<Daru::DataFrame(5x1)>
num
a 52
b 12
a 7
a 17
c 1
irb(main):004:0> df_cat_idx.aggregate(num: :sum)
=> #<Daru::DataFrame(3x1)>
num
a 76
b 12
c 1
@zverok , Using GroupBy
class we are able to make dataframe
with new index (group_by keys are now index). Now we are using aggregate
method on this dataframe
. We can use aggregate
on DataFrame as well (see above comment).
So I think , group_by is useful for grouping. aggregate
can't do those thing.
Using
GroupBy
class we are able to make dataframe with new index (group_by keys are now index).
Sorry, I am not sure I get it. Why DataFrame#group_by
can't return just DataFrame
, instead of GroupBy
instance?..
@zverok , If we want to return the df
from the group_by
then we have one extra level in multiIndex (the original index)
i.e
irb(main):008:0> df = Daru::DataFrame.new(
irb(main):009:1* {employee: %w[John John Jane Jane Mark Mark],
irb(main):010:2* month: %w[June June June July July July],
irb(main):011:2* salary: [1000, 500, 700, 1200, 600, 600]}, index: d2
irb(main):012:1> )
=> #<Daru::DataFrame(6x3)>
employee month salary
100 John June 1000
99 John June 500
101 Jane June 700
1 Jane July 1200
2 Mark July 600
3 Mark July 600
irb(main):013:0> df.group_by(:employee)
=> #<Daru::DataFrame(6x2)>
month salary
Jane 101 June 700
1 July 1200
John 100 June 1000
99 June 500
Mark 2 July 600
3 July 600
See there is 2 levels one is emplyee_name
and another is original index [101, 1,100,99,2,3].
I am removing this extra level using remove_layer when GroupBy#aggregate
is called. and sending to the DataFrame#aggregate
.
So if we want to return the GroupBy#df
in DataFrame#group_by
then we must not include the 2nd level (the index level) and use CategoricalIndex
(since duplicate index is present and there is only 1 level or automatically single(when no duplicate index)/multiIndex index(when levels.size>1) ).
@zverok merge this?
OK, let's merge. I disagree on GroupBy
object still, but I'll try to take a hand on it myself during v1.0 preparation.
Fixes https://github.com/SciRuby/daru/issues/152 part 2 Extension of https://github.com/SciRuby/daru/pull/330
Examples :
TODO :