SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

DataFrame#vector_sum should ignore nils #391

Closed parthm closed 6 years ago

parthm commented 6 years ago

Daru::DataFrame#vector_sum makes the sum nil if the row contains a nil. It may be worth while ignoring the nil entries for sum.

Note that the last row below shows the sum as nil instead of 5.

irb(main):001:0> df
=> limited output
 #<Daru::DataFrame(11x4)>
      v0  v1  v2  v3
   0  -5  11  21  31
   1  -4  12  22  32
   2  -3  13  23  33
   3  -2  14  24  34
   4  -1  15  25  35
   5   0  16  26  36
   6   1  17  27  37
   7   2  18  28  38
   8   3  19  29  39
   9   4  20  30  40
  10   5 nil nil nil
irb(main):002:0> df[:vx] = df.vector_sum
=> limited output
irb(main):003:0> df
=> limited output
 #<Daru::DataFrame(11x5)>
      v0  v1  v2  v3  vx
   0  -5  11  21  31  58
   1  -4  12  22  32  62
   2  -3  13  23  33  66
   3  -2  14  24  34  70
   4  -1  15  25  35  74
   5   0  16  26  36  78
   6   1  17  27  37  82
   7   2  18  28  38  86
   8   3  19  29  39  90
   9   4  20  30  40  94
  10   5 nil nil nil nil
irb(main):004:0>

Vector#sum ignores nil and presents the sum correctly. DataFrame#vector_sum ignoring nil would make the behavior more consistent. Vector#sum below ignores the 10th entry which is nil.

irb(main):010:0> v
=> limited output
 #<Daru::Vector(11)>
      v1
   0  11
   1  12
   2  13
   3  14
   4  15
   5  16
   6  17
   7  18
   8  19
   9  20
  10 nil
irb(main):011:0> v.sum
=> limited output
 155
parthm commented 6 years ago

Looking into the tests, there seems to be a specific test to ensure that the sum is nil if the row contains any nils. So, I assume this is intended behavior at least for some use cases.

It may be worth going the pandas.DataFrame.sum route and having a skipnil flag that defaults to false. This would avoid breaking compatibility and also allow for the new behavior.

That said, if there aren't many folks out there counting on the old vector_sum behavior, if may be good to make it more consistent with Vector#sum and make skipnil: true the default. It would be nice if Vector#sum and vector_sum behaved the same way by default. Eventually it may make sense to have a skipnil flag in Vector#sum as well.

v0dro commented 6 years ago

Works. I like approach with flags.

baarkerlounger commented 6 years ago

Think this can be closed now the above PR has been merged?