Open vanitu opened 6 years ago
@vanitu agreed
Hi,
I have an Dataset class that is a wrapper to Daru and delegates to Daru much of the time. This allows me to create business specific transforms that would not be suitable for open source. I wrote the following summarize transform. If folks like the API I could try to make an enhancement.
wafer_median_isats = dset.summarize(/isat/, group_by: [:lot_id, :waf_num], stats: :median)
The first argument is the columns to summarize, the group_by argument does just that and the stats argument can be a single stat or an array of stats (e.g. [:median, :mean]). The summarize method is as follows:
summarized_hash = Hash.new { |h, k| h[k] = [] }.tap do |sum_dset|
options[:stats].each do |statistic|
data_frame.group_by(group_columns).each_group do |dframe|
unless dframe[0].respond_to? statistic
puts "Cannot summarize by stat '#{statistic}'!"
fail
end
dframe.each_vector_with_index do |vec, col_name|
if group_columns.include? col_name
sum_dset[col_name] << vec[0]
elsif summarize_columns.include? col_name
sum_dset["#{col_name}_#{statistic}".to_sym] << vec.send(statistic).to_f.round(4)
end
end
end
end
end
I then just instantiate a new DataFrame using the hash of arrays 'summarized_hash'. Does this look like the most efficient way to create a statistical summary?
@vanitu I belive you should use DataFrame#aggregate
df.group_by(:a).aggregate(
avg_d: ->(df) { df[:d].mean },
sum_c: ->(df) { df[:c].sum },
avg_of_c: ->(df) { df[:c].mean },
size_b_with_lambda: ->(grouped){ grouped[:b].size},
uniq_b_with_proc: proc {|grouped| grouped[:b].uniq.size }
)
=> #<Daru::DataFrame(2x5)>
avg_d sum_c avg_of_c size_b_wit uniq_b_wit
bar 44.0 9 3.0 3 3
foo 52.8 18 3.6 5 3
Hi, Daru community.
I was trying to find a simple function how DataFrame can be summarized using customisable aggregation function for each new Vector, but can't find any flexible solution.
Sometimes you need to apply different aggregations Idea comes from R dplyr where you can run summarise on grouped data.
Here is short example which I think is mostly obvious on it self. It allow in quick to run different aggregations
I also realised that in piece of code, but not sure if this function is not yet exists somewhere in Daru.