SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

Optimize aggregation #464

Closed paisible-wanderer closed 5 years ago

paisible-wanderer commented 5 years ago

Hello,

In the continuation of #454, I have made some improvements likely impacting all the aggregation related code (by making things "lazy").

Bellow are some aggregation benchmarks for different ratios of rows vs number of groups. (Summary: some things are ~10 times faster).

def gen_df(n_grs, nrows=3000)
  Daru::DataFrame.new({
    group_nb: (1..nrows).map { (1..n_grs).to_a.sample },
    value:    (1..nrows).map { rand },
  })
end

df_s = gen_df   20 # dataframe with a few groups (each having about 150 elements)
df_m = gen_df 2000
df_l = gen_df 4000 # almost every group has one row: lot of sub-vectors creation

# l for lambda
lsize = ->(sub_df) { sub_df[:value].size }
lsum  = ->(sub_df) { sub_df[:value].sum }

require 'benchmark'

# 1/ Benchmark comparing operation over vector and operation with lambda (over sub-dataframes)
# r is for "reference"
Benchmark.bm(8) do |b|
  b.report("small r")  { 100.times { df_s.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }
  b.report("medium r") { 100.times { df_m.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }
  b.report("large r")  { 100.times { df_l.group_by(:group_nb).aggregate({group_nb: :size, value: :sum}) } }

  b.report("small")    { 100.times { df_s.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
  b.report("medium")   { 100.times { df_m.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
  b.report("large")    { 100.times { df_l.group_by(:group_nb).aggregate({a: lsize, b: lsum}) } }
end

# 2/ Benchmark to see if we could use aggregation instead of the functions in group_by
# r is for "reference"
Benchmark.bm(8) do |b|
  b.report("small r")  { 300.times { df_s.group_by(:group_nb).sum } }
  b.report("small")    { 300.times { df_s.group_by(:group_nb).aggregate({value: :sum}) } }

  b.report("medium r") { 300.times { df_m.group_by(:group_nb).sum } }
  b.report("medium")   { 300.times { df_m.group_by(:group_nb).aggregate({value: :sum}) } }

  b.report("large r")  { 300.times { df_l.group_by(:group_nb).sum } }
  b.report("large")    { 300.times { df_l.group_by(:group_nb).aggregate({value: :sum}) } }
end

## before
# 1/
#                user     system      total        real
# small r   34.150000   0.040000  34.190000 ( 34.201153)
# medium r  67.450000   0.010000  67.460000 ( 67.408840)
# large r   64.300000   0.000000  64.300000 ( 64.294158)
# small     36.600000   0.090000  36.690000 ( 36.677260)
# medium    94.140000   0.000000  94.140000 ( 94.129418)
# large     95.100000   0.000000  95.100000 ( 95.096091)

# 2/
               user     system      total        real
# small r   19.460000   0.000000  19.460000 ( 19.464180)
# small     38.450000   0.150000  38.600000 ( 38.599517)
# medium r  30.440000   0.000000  30.440000 ( 30.432033)
# medium    84.610000   0.000000  84.610000 ( 84.609521)
# large r   33.190000   0.000000  33.190000 ( 33.188716)
# large     96.660000   0.000000  96.660000 ( 96.658534)

## intermediary stats (at commit "Avoids intermediary dataframe creation in Daru::Core::GroupBy#aggregate")
# 1/
               user     system      total        real
# small r    3.710000   0.000000   3.710000 (  3.710371)
# medium r  12.250000   0.000000  12.250000 ( 12.256285)
# large r   16.440000   0.000000  16.440000 ( 16.435032)
# small      5.280000   0.000000   5.280000 (  5.279285)
# medium    37.520000   0.010000  37.530000 ( 37.522483)
# large     48.070000   0.000000  48.070000 ( 48.067751)

# 2/
               user     system      total        real
# small r   10.630000   0.000000  10.630000 ( 10.625822)
# small      9.980000   0.000000   9.980000 (  9.984697)
# medium r  19.790000   0.000000  19.790000 ( 19.789653)
# medium    31.110000   0.000000  31.110000 ( 31.102751)
# large r   20.050000   0.000000  20.050000 ( 20.049377)
# large     41.930000   0.000000  41.930000 ( 41.924821)

## stats at final commit
# 1/
               user     system      total        real
# small r    2.200000   0.000000   2.200000 (  2.198592)
# medium r   6.540000   0.000000   6.540000 (  6.549165)
# large r    8.020000   0.000000   8.020000 (  8.017494)
# small      2.720000   0.000000   2.720000 (  2.717410)
# medium    13.550000   0.000000  13.550000 ( 13.548042)
# large     14.470000   0.000000  14.470000 ( 14.472992)

# 2/
               user     system      total        real
# small r    6.150000   0.000000   6.150000 (  6.148050)
# small      5.440000   0.000000   5.440000 (  5.433489)
# medium r  12.580000   0.000000  12.580000 ( 12.578441)
# medium    11.770000   0.000000  11.770000 ( 11.766527)
# large r   13.430000   0.000000  13.430000 ( 13.435111)
# large     13.970000   0.010000  13.980000 ( 13.973275)

=========================

Please, let me know if there is anything I can improve.

paisible-wanderer commented 5 years ago

@zverok & @v0dro: ping!

v0dro commented 5 years ago

@paisible-wanderer I'm extremely sorry for this huge delay. I'll look into this PR ASAP. Just give me a day more since the changes are many.

v0dro commented 5 years ago

OK I just went through your changes and they're pretty amazing. I'm merging. Thank you very much for your changes and hope to see more of these sometime again soon :)

And sorry again for the delay. The PR just went past my checks for some reason.

paisible-wanderer commented 5 years ago

@v0dro thanks!