SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

[Bug] Daru::Dataframe#aggregate doesn't work as expected #453

Closed paisible-wanderer closed 6 years ago

paisible-wanderer commented 6 years ago

Hello,

example: (fictional) spending of a startup

data = [
  [2010, 'dev', 50],
  [2010, 'dev', 150],
  [2010, 'dev', 200],

  [2011, 'dev', 50],
  [2012, 'dev', 150],

  [2011, 'office', 300],

  [2010, 'market',  50],
  [2011, 'market', 500],
  [2012, 'market', 500],
  [2012, 'market', 300],

  [2012, 'R&D', 10],
]
spending_df = Daru::DataFrame.rows(data, order: [:year, :category, :spending])
spending_df[:nb_spending] = spending_df[:spending].recode { 1 }

spending_df
 # => #<Daru::DataFrame(11x4)>
 #                  year   category   spending nb_spendin
 #          0       2010        dev         50          1
 #          1       2010        dev        150          1
 #          2       2010        dev        200          1
 #          3       2011        dev         50          1
 #          4       2012        dev        150          1
 #          5       2011     office        300          1
 #          6       2010     market         50          1
 #          7       2011     market        500          1
 #          8       2012     market        500          1
 #          9       2012     market        300          1
 #         10       2012        R&D         10          1 

ISSUE

# NOT working:
spending_df.group_by([:year, :category]).aggregate(spending: :sum)

# working:
spending_df.group_by([:year, :category]).sum

# Expected result -- we should have
spending_df.group_by([:year, :category]).aggregate(spending: :sum) == spending_df.group_by([:year, :category]).sum

 Origin?

PR #340 was likely merged too soon

This test doesn't really check multi-indexed aggregation (cf same number of rows before/after the group_by+aggregation)

Suggestion

It is possible to optimize groupping and aggregation but getting away with the intermediary dataframe generated by Daru::Dataframe#group_by

I propose to add Daru::Dataframe#group_by_and_aggregate.

FIX

I have a PR ready : #454

It builds on and refactors PR #330 and PR #340.

I think it is also a continuation/improvement along #152.

zverok commented 6 years ago

Fixed with #454, will be released soon.