SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Aggregate DataFrame via Summarise Function [HELP Wanted] #458

Open vanitu opened 6 years ago

vanitu commented 6 years ago

Hi, Daru community.

I was trying to find a simple function how DataFrame can be summarized using customisable aggregation function for each new Vector, but can't find any flexible solution.

Sometimes you need to apply different aggregations Idea comes from R dplyr where you can run summarise on grouped data.

Here is short example which I think is mostly obvious on it self. It allow in quick to run different aggregations

df => #<Daru::DataFrame(8x4)>
           a     b     c     d
     0   foo   one     1    11
     1   bar   one     2    22
     2   foo   two     3    33
     3   bar three     1    44
     4   foo   two     3    55
     5   bar   two     6    66
     6   foo   one     3    77
     7   foo three     8    88

#proposed notation
summary = df.group_by(:a).summarise_with(
 avg_d: [:mean,:d],
 sum_c: [:sum,:c],
 avg_of_c: [:mean,:c],
 size_b_with_lambda: ->(grouped){ grouped[:b].size}, 
uniq_b_with_proc: proc {|grouped| grouped[:b].uniq.size }
)

#Result
=> #<Daru::DataFrame(2x5)>
                 avg_d      sum_c   avg_of_c size_b_wit uniq_b_wit
        bar       44.0          9        3.0          3          3
        foo       52.8         18        3.6          5          3

I also realised that in piece of code, but not sure if this function is not yet exists somewhere in Daru.


class Daru::Core::GroupBy
  def summarise_with(**aggregations)
    super_hash = groups.map {|n, _| [n, {}]}.to_h
    groups.keys.each do |group_name|
      group_data = get_group(group_name)
      aggregations.each do |new_vector, opts|
        aggregation, vector = Array(opts)
        to_aggregate = group_data.has_vector?(vector) ? group_data[vector] : group_data
        super_hash[group_name][new_vector] = if aggregation.is_a?(Proc)
                                               aggregation.call(to_aggregate)
                                             else
                                               to_aggregate.send(aggregation)
                                             end
      end
    end
    Daru::DataFrame.new(super_hash.values, index: super_hash.keys)
  end
end
info-rchitect commented 6 years ago

@vanitu agreed

info-rchitect commented 6 years ago

Hi,

I have an Dataset class that is a wrapper to Daru and delegates to Daru much of the time. This allows me to create business specific transforms that would not be suitable for open source. I wrote the following summarize transform. If folks like the API I could try to make an enhancement.

wafer_median_isats = dset.summarize(/isat/, group_by: [:lot_id, :waf_num], stats: :median)

The first argument is the columns to summarize, the group_by argument does just that and the stats argument can be a single stat or an array of stats (e.g. [:median, :mean]). The summarize method is as follows:

        summarized_hash = Hash.new { |h, k| h[k] = [] }.tap do |sum_dset|
            options[:stats].each do |statistic|
              data_frame.group_by(group_columns).each_group do |dframe|
                unless dframe[0].respond_to? statistic
                  puts "Cannot summarize by stat '#{statistic}'!"
                  fail
                end
                dframe.each_vector_with_index do |vec, col_name|
                  if group_columns.include? col_name
                    sum_dset[col_name] << vec[0]
                  elsif summarize_columns.include? col_name
                    sum_dset["#{col_name}_#{statistic}".to_sym] << vec.send(statistic).to_f.round(4)
                  end
                end
              end
            end
          end

I then just instantiate a new DataFrame using the hash of arrays 'summarized_hash'. Does this look like the most efficient way to create a statistical summary?

paisible-wanderer commented 6 years ago

@vanitu I belive you should use DataFrame#aggregate

df.group_by(:a).aggregate(
 avg_d:    ->(df) { df[:d].mean },
 sum_c:    ->(df) { df[:c].sum },
 avg_of_c: ->(df) { df[:c].mean },
 size_b_with_lambda: ->(grouped){ grouped[:b].size}, 
 uniq_b_with_proc: proc {|grouped| grouped[:b].uniq.size }
)
=> #<Daru::DataFrame(2x5)>
                 avg_d      sum_c   avg_of_c size_b_wit uniq_b_wit
        bar       44.0          9        3.0          3          3
        foo       52.8         18        3.6          5          3