SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

df.group_by returns dataframe with multi index #330

Closed Shekharrajak closed 7 years ago

Shekharrajak commented 7 years ago

Fixes #152 Part 1

Example :

irb(main):001:0> df = Daru::DataFrame.new(
irb(main):002:1*   {employee: %w[John Jane Mark John Jane Mark],
irb(main):003:2*   month: %w[June June June July July July],
irb(main):004:2*   salary: [1000, 500, 700, 1200, 600, 600]}
irb(main):005:1> )
=> #<Daru::DataFrame(6x3)>
          employee    month   salary
        0     John     June     1000
        1     Jane     June      500
        2     Mark     June      700
        3     John     July     1200
        4     Jane     July      600
        5     Mark     July      600
irb(main):006:0> df.group_by(:employee)
=> #<Daru::DataFrame(6x2)>
                month salary
   Jane      1   June    500
             4   July    600
   John      0   June   1000
             3   July   1200
   Mark      2   June    700
             5   July    600
irb(main):007:0> 

irb(main):007:0> d2 = Daru::Index.new [100, 99, 101, 1, 2,3]
=> #<Daru::Index(6): {100, 99, 101, 1, 2, 3}>
irb(main):008:0> df = Daru::DataFrame.new(
irb(main):009:1*   {employee: %w[John John Jane Jane Mark Mark],
irb(main):010:2*   month: %w[June June June July July July],
irb(main):011:2*   salary: [1000, 500, 700, 1200, 600, 600]}, index: d2
irb(main):012:1> )
=> #<Daru::DataFrame(6x3)>
          employee    month   salary
      100     John     June     1000
       99     John     June      500
      101     Jane     June      700
        1     Jane     July     1200
        2     Mark     July      600
        3     Mark     July      600
irb(main):013:0> df.group_by(:employee)
=> #<Daru::DataFrame(6x2)>
                month salary
   Jane    101   June    700
             1   July   1200
   John    100   June   1000
            99   June    500
   Mark      2   July    600
             3   July    600
Shekharrajak commented 7 years ago

There is still many works to do in this PR. But it will be good if someone review the approach.

Shekharrajak commented 7 years ago

Behaviour on multiple grouping after this commit https://github.com/SciRuby/daru/pull/330/commits/5dcca795868d6dde4936bfb2b0b6846a75c00b22

# using above example only

irb(main):007:0> df.group_by([:employee, :month])
=> #<Daru::DataFrame(6x1)>
                      salary
   Jane   July      4    600
          June      1    500
   John   July      3   1200
          June      0   1000
   Mark   July      5    600
          June      2    700

#------

irb(main):009:0> df.group_by([:employee, :month, :salary])
=> #<Daru::DataFrame(6x0)>
 Jane July  600    4
      June  500    1
 John July 1200    3
      June 1000    0
 Mark July  600    5
      June  700    2
zverok commented 7 years ago

Sorry for delay, busy times. Reviewing it currently.

Shekharrajak commented 7 years ago

I hope now I have completed the PR. @zverok , please have a look. Thanks.

zverok commented 7 years ago

Now reviewing it FOR REALZ :)

Shekharrajak commented 7 years ago

Okay, no problem :)

zverok commented 7 years ago

Well, first of all I'd like to know what is your further plan for following the GitHub issue? Because dataframe.group_by(...).df looks weird for me, I'd like to know whether it is some "temporary" solution, or you believe it is the best API possible for the task? From an answer to this question our further actions depend.

zverok commented 7 years ago

@Shekharrajak I've asked:

Well, first of all I'd like to know what is your further plan for following the GitHub issue?

I really need an answer for further discussion, not for "examining" you or something. We need some common plan, the issue is really important and your work could have a great impact!

Shekharrajak commented 7 years ago

@zverok , I haven't think too much but when I opened this PR, I tried to keep in mind that this is just part 1, there is other part (DataFrame#summarize ) also to fix.

I understand that dataframe.group_by(...).df looks weird. But I am here just showing the resultant dataframe for the better user experience. User still can do other group_by operations like count, mean etc. since group_by returns the GroupBy class(it is just showing dataframe). So it take care the old testcases.

irb(main):006:0> df.group_by(:employee)
=> #<Daru::DataFrame(6x2)>
                month salary
   Jane      1   June    500
             4   July    600
   John      0   June   1000
             3   July   1200
   Mark      2   June    700
             5   July    600
irb(main):007:0> df.group_by(:employee).class
=> Daru::Core::GroupBy

For Part 2: Add DataFrame#summarize for multi-indexed DFs. I have thought to use this resultant dataframe for the summarize method.

I also need to understand why many testcases are sorted (https://github.com/SciRuby/daru/issues/324) . When I remove the sorting I get error mostly just ordering is changed.

zverok commented 7 years ago

OK, let's do it this way: I'll merge current work in master in current state, with expectation that you'll do "part 2" too, and we rule out all doubtful moments at that point, OK?

Shekharrajak commented 7 years ago

@zverok Okay, I will make this PR better. I will do few more commits soon. Thanks.

zverok commented 7 years ago

@Shekharrajak no, you've missed my idea :) Let's merge this PR just now, and then do a separate PR with "part 2" of the issue -- and spend some time on that second with cleaning up. WDYT?

Shekharrajak commented 7 years ago

@zverok , fine. I will open new PR after few days with the proper code.

zverok commented 7 years ago

:+1: Merged this one, then. Thanks.