SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Defining group_by#aggregate #340

Closed Shekharrajak closed 7 years ago

Shekharrajak commented 7 years ago

Fixes https://github.com/SciRuby/daru/issues/152 part 2 Extension of https://github.com/SciRuby/daru/pull/330

Examples :

irb(main):001:0> df = Daru::DataFrame.new(
irb(main):002:1*   {employee: %w[John Jane Mark John Jane Mark],
irb(main):003:2*   month: %w[June June June July July July],
irb(main):004:2*   salary: [1000, 500, 700, 1200, 600, 600]}
irb(main):005:1> )
=> #<Daru::DataFrame(6x3)>
          employee    month   salary
        0     John     June     1000
        1     Jane     June      500
        2     Mark     June      700
        3     John     July     1200
        4     Jane     July      600
        5     Mark     July      600
irb(main):006:0> df.group_by(:employee).summarize(salary: :sum)
=> #<Daru::DataFrame(3x1)>
        salary
   Jane   1100
   John   2200
   Mark   1300
irb(main):007:0> df.group_by(:employee, :month).summarize(salary: :sum)
=> #<Daru::DataFrame(6x1)>
               salary
   Jane   July    600
          June    500
   John   July   1200
          June   1000
   Mark   July    600
          June    700
irb(main):008:0> df.group_by(:employee).summarize(
irb(main):009:1*     month: ->(vec) { vec.to_a.join('/') },
irb(main):010:1*     salary: :sum
irb(main):011:1> )
=> #<Daru::DataFrame(3x2)>
               month    salary
      Jane June/July      1100
      John June/July      2200
      Mark June/July      1300
irb(main):012:0> df.group_by(:employee).summarize(
irb(main):013:1*     salary: :sum,
irb(main):014:1*     month: ->(vec) { vec.to_a.join('/') },
irb(main):015:1*     mean_salary: ->(df) { df.salary.mean },
irb(main):016:1*     periods: ->(df) { df.size }
irb(main):017:1> )
=> #<Daru::DataFrame(3x4)>
                salary      month mean_salar    periods
       Jane       1100  June/July      550.0          2
       John       2200  June/July     1100.0          2

TODO :

Shekharrajak commented 7 years ago

Still working on improving this PR. Any kind of suggestions are welcome. Ping @zverok

zverok commented 7 years ago

I believe the entire approach is wrong. The whole idea of #152 is to get rid of dedicated GroupBy class. So, the #summarize should be implemented as a DataFrame method, working on multi-index dataframes.

So, after the problem fully resolved, df.group_by will just immediately return multi-index dataframe, and summarize can be used on it. Or on any other multi-index dataframe.

We are trying to have less special cases and objects, not more.

Shekharrajak commented 7 years ago

So, after the problem fully resolved, df.group_by will just immediately return multi-index dataframe, and summarize can be used on it. Or on any other multi-index dataframe.

@zverok , I am using final dataframe (which is the result of group_by) @df , in apply_method_on_colmns and apply_method_on_df methods.

It can create multi-index dataframe as well(since it is using @keys as the dataframe index).

Example

irb(main):007:0> df.group_by(:employee, :month).summarize(salary: :sum)
=> #<Daru::DataFrame(6x1)>
               salary
   Jane   July    600
          June    500
   John   July   1200
          June   1000
   Mark   July    600
          June    700

In this case @keys is [[Jane, July], [Jane, June], [John, July], [John, June], [Mark, July], [Mark, June]]

In this example :

irb(main):006:0> df.group_by(:employee).summarize(salary: :sum)
=> #<Daru::DataFrame(3x1)>
        salary
   Jane   1100
   John   2200
   Mark   1300

@keys is [[Jane], [John], [Mark]]. So it creates single index dataframe.

zverok commented 7 years ago

I am using final dataframe (which is the result of group_by) @df

Yeah, I can read the code, I swear :)

But the point is summarize should be the method available for ANY dataframe, despite of grouping feature. This way "grouping" and "summarizing" will became two simple, clean features, not relying on each other.

Shekharrajak commented 7 years ago

Means summarize must use only resultant dataframe @df . It must be independent from the group_by methods, attributes. Dataframe @df itself says everything about group_by parameters like @keys, @ non_group_vectors etc.

Right? @zverok

zverok commented 7 years ago

Main summarize method must be defined in the dataframe.rb

Right.

and group_by#summarize should call the dataframe#summarize for the summary.

Yes. At this point, dedicated GroupBy class should be dismissed. group_by should immediately return grouped DataFrame, on which any method could be called (including, of course, summarize)

zverok commented 7 years ago

@Shekharrajak I believe that your example is weird as :hankey:. There shouldn't be such thing as 1-level MultiIndex. There shouldn't be such thing as MultiIndex with repeating tuples. I've added https://github.com/SciRuby/daru/issues/342 to fix it.

baarkerlounger commented 7 years ago

I think 'Aggregate' is probably a better method name than summarize for you're doing?

zverok commented 7 years ago

I think 'Aggregate' is probably a better method name than summarize for you're doing?

Actually, makes sense to me! Let's rename :)

Shekharrajak commented 7 years ago

I will rename it.

Shekharrajak commented 7 years ago

There is no blank line but travis showing the error. Weird!

Shekharrajak commented 7 years ago

Now errors are fixed. Here is some examples :

irb(main):002:0> dataframe = Daru::DataFrame.new({
irb(main):003:2*       employee: %w[John Jane Mark John Jane Mark],
irb(main):004:2*       month: %w[June June June July July July],
irb(main):005:2*       salary: [1000, 500, 700, 1200, 600, 600]})
=> #<Daru::DataFrame(6x3)>
          employee    month   salary
        0     John     June     1000
        1     Jane     June      500
        2     Mark     June      700
        3     John     July     1200
        4     Jane     July      600
        5     Mark     July      600
irb(main):006:0> dataframe.group_by([:employee]).aggregate(salary: :sum) 
=> #<Daru::DataFrame(3x1)>
        salary
   Jane   1100
   John   2200
   Mark   1300
irb(main):007:0> dataframe.group_by([:employee]).aggregate(salary: :sum).index
=> #<Daru::Index(3): {Jane, John, Mark}>
irb(main):008:0> dataframe.group_by([:employee, :month]).aggregate(salary: :sum) 
=> #<Daru::DataFrame(6x1)>
               salary
   Jane   July    600
          June    500
   John   July   1200
          June   1000
   Mark   July    600
          June    700

irb(main):012:0> dataframe.group_by([:employee]).aggregate(
irb(main):013:1*         salary: :sum,
irb(main):014:1*         month: ->(vec) { vec.to_a.join('/') }) 
=> #<Daru::DataFrame(3x2)>
              salary     month
      Jane      1100 June/July
      John      2200 June/July
      Mark      1300 June/July

irb(main):025:0> dataframe.group_by([:employee]).aggregate(
irb(main):026:1*         salary: :sum,
irb(main):027:1*         month: ->(vec) { vec.to_a.join('/') },
irb(main):028:1*         mean_salary: ->(df) { df.salary.mean },
irb(main):029:1*         periods: ->(df) { df.size })
=> #<Daru::DataFrame(3x4)>
                salary      month mean_salar    periods
       Jane       1100  June/July      550.0          2
       John       2200  June/July     1100.0          2
       Mark       1300  June/July      650.0          2
zverok commented 7 years ago

I'll review entire solution in details later today, but in the meantime: don't you think it's time to finally drop GroupBy class completely? Due to new methods, it became just a useless proxy.

Shekharrajak commented 7 years ago

Few more examples (aggregate on dataframe) :

irb(main):001:0> idx = Daru::CategoricalIndex.new [:a, :b, :a, :a, :c] 
=> #<Daru::CategoricalIndex(5): {a, b, a, a, c}>
irb(main):002:0> df = Daru::DataFrame.new(num: [52,12,07,17,01], index: idx)
=> #<Daru::DataFrame(5x2)>
       index   num
     0     a    52
     1     b    12
     2     a     7
     3     a    17
     4     c     1
irb(main):003:0> df.aggregate(num_100_times: ->(df) { df.num*100 })
=> #<Daru::DataFrame(5x1)>
            num_100_ti
          0       5200
          1       1200
          2        700
          3       1700
          4        100

irb(main):008:0>  df.aggregate(num_100_times: ->(df) { df.num*100 }).class
=> Daru::DataFrame
irb(main):009:0>  df.aggregate(num_100_times: ->(df) { df.num*100 }).index
=> #<Daru::Index(5): {0, 1, 2, 3, 4}>
irb(main):011:0>  df.aggregate(num_100_times: ->(df) { df.num*100 }).index.to_a
=> [0, 1, 2, 3, 4]

irb(main):002:0> cat_idx = Daru::CategoricalIndex.new [:a, :b, :a, :a, :c] 
=> #<Daru::CategoricalIndex(5): {a, b, a, a, c}>
irb(main):003:0> df_cat_idx = Daru::DataFrame.new({num: [52,12,07,17,01]}, index: cat_idx)
=> #<Daru::DataFrame(5x1)>
     num
   a  52
   b  12
   a   7
   a  17
   c   1
irb(main):004:0> df_cat_idx.aggregate(num: :sum)
=> #<Daru::DataFrame(3x1)>
     num
   a  76
   b  12
   c   1
Shekharrajak commented 7 years ago

@zverok , Using GroupBy class we are able to make dataframe with new index (group_by keys are now index). Now we are using aggregate method on this dataframe. We can use aggregate on DataFrame as well (see above comment).

So I think , group_by is useful for grouping. aggregate can't do those thing.

zverok commented 7 years ago

Using GroupBy class we are able to make dataframe with new index (group_by keys are now index).

Sorry, I am not sure I get it. Why DataFrame#group_by can't return just DataFrame, instead of GroupBy instance?..

Shekharrajak commented 7 years ago

@zverok , If we want to return the df from the group_by then we have one extra level in multiIndex (the original index)

i.e

irb(main):008:0> df = Daru::DataFrame.new(
irb(main):009:1*   {employee: %w[John John Jane Jane Mark Mark],
irb(main):010:2*   month: %w[June June June July July July],
irb(main):011:2*   salary: [1000, 500, 700, 1200, 600, 600]}, index: d2
irb(main):012:1> )
=> #<Daru::DataFrame(6x3)>
          employee    month   salary
      100     John     June     1000
       99     John     June      500
      101     Jane     June      700
        1     Jane     July     1200
        2     Mark     July      600
        3     Mark     July      600
irb(main):013:0> df.group_by(:employee)
=> #<Daru::DataFrame(6x2)>
                month salary
   Jane    101   June    700
             1   July   1200
   John    100   June   1000
            99   June    500
   Mark      2   July    600
             3   July    600

See there is 2 levels one is emplyee_name and another is original index [101, 1,100,99,2,3]. I am removing this extra level using remove_layer when GroupBy#aggregate is called. and sending to the DataFrame#aggregate.

So if we want to return the GroupBy#df in DataFrame#group_by then we must not include the 2nd level (the index level) and use CategoricalIndex (since duplicate index is present and there is only 1 level or automatically single(when no duplicate index)/multiIndex index(when levels.size>1) ).

v0dro commented 7 years ago

@zverok merge this?

zverok commented 7 years ago

OK, let's merge. I disagree on GroupBy object still, but I'll try to take a hand on it myself during v1.0 preparation.