SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

get_group unpredicted behaviour in case of Sorting applied #535

Open vanitu opened 4 years ago

vanitu commented 4 years ago

When group_by applied on sorted DataFrame get_group will return wrong entries in DataFrame

df=Daru::DataFrame.new([
                           10.times.collect{|i| i},
                           10.times.collect{|i| "b"},
                           10.times.collect{|i| i%2 == 0 ? "c" : "d"},
                       ],
                       order: [:a,:b,:c]
                       )

#Works Properly
grouped=df.group_by([:b,:c])
grouped.get_group(["b","c"])

=> #<Daru::DataFrame(5x3)>
       a   b   c
   0   0   b   c
   2   2   b   c
   4   4   b   c
   6   6   b   c
   8   8   b   c 

#Corrupted after sort applied to DF
df.sort!([:c])
grouped=df.group_by([:b,:c])
grouped.get_group(["b","c"])

=> #<Daru::DataFrame(5x3)>
       a   b   c
   0   0   b   c
   2   4   b   c
   4   8   b   c
   6   3   b   d
   8   7   b   d 
vanitu commented 4 years ago

As I understand reindexing after sorting may help. df.index = Daru::Index.new(Array.new(df.size) { |i| i })

bradleybuda commented 3 years ago

I'm running into a similar issue that occurs when you remove rows from a dataset using filter before calling group_by - it looks like get_group does not respect non-standard indices on rows, so grouping operations will only work if your rows are indexed the default way (zero-based, consecutive integers). I don't know the Daru internals well, but the issue appears to be here: https://github.com/SciRuby/daru/blob/v0.2.2/lib/daru/core/group_by.rb#L258-L267

The conversion of @context to elements throws away @context's original indices, and references in to elements.transpose assume that the indices are the defaults (i.e. 0, 1, 2, 3, ...).