fast vectorized summaries

To program plyrmr efficiently in the face of small groups one has the option to go with the vectorized reduce option. Then each call to the reducer will get in input multiple groups. In plyrmr this takes the form of a data frame with multiple groups. The problem is to process such data frame. If one goes for a split or some such, it is as slow as the non-vectorized mode because the split creates many small data frames and is normally followed by a lapply which calls an interpreted R function. So no point going down that path.

The idea is to provide a few predefined fast vectorized reducers. The specific form of this idea is to leverage explored in this issue is to use dplyr to help with that. dplyr has a system for handling summaries that sidesteps the interpreter for simple functions (vaguely described as handlers by the dplyr crew)

The general transformation of a non vectorized reduce to a vectorized one could look like:

plyrmr::do(plyrmr::group(input(path), vars), f) => 
plyrmr::do(plyrmr::group(input(path), vars), function(x) dplyr::do(dplyr::group_by(x, vars), f))

Unfortunately, the semantics of dplyr::do is different from plyrmr:do (it returns lists, weird) and described by @romain as work in progress. So the only dplyr function we can use is summarize AFAICT. So the above transformation would become

plyrmr::summarize(plyrmr::group(input(path), vars), exp) =>
plyrmr::summarize(plyrmr::group(input(path), vars), function(x) dplyr::summarize(dplyr::group_by(x, vars), f))

I am not totally clear what the generality of the handler mechanism is. These are some experiments:

> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), mysum(x)))
   user  system elapsed 
  0.201   0.012   0.211 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), mysum(x)))
   user  system elapsed 
  0.007   0.001   0.008 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), sum(x)))
   user  system elapsed 
  0.090   0.005   0.094 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), sum(x)))
   user  system elapsed 
  0.006   0.000   0.006 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x), median(x)))
   user  system elapsed 
  2.589   0.009   2.597 
> system.time(dplyr::summarize(dplyr::group_by(data.frame(x = 1:10^5), x%%2), median(x)))
   user  system elapsed 
  0.008   0.001   0.008

We are contrasting the performance on two vs 10^5 groups, while keeping the amount of data fixed. mysum is just function(x) sum(x)

Quick design document https://gist.github.com/piccolbo/e94e8633304401291ae6 Work ongoing in vectorized-groups branch

12X speed up

system.time(as.data.frame(summarise(group(transmute(input(text), words = unlist(strsplit(lines, " ")), count =1 ), words), count = sum(count)))) user system elapsed 0.458 0.017 0.477 system.time(as.data.frame(transmute(group(transmute(input(text), words = unlist(strsplit(lines, " ")), count =1 ), words), count = sum(count), .mergeable = T))) user system elapsed 6.153 0.029 6.184

summarise was defined with

magic.wand(summarise, add.envir.arg = T, non.standard.args = T, mergeable = T, vectorized = T)

Performance is still not good 20kw/second on simulated zipf distributed text, randomly permuted. I am sure one can do better in rmr. From profiling data the big offender is gc. It could be that there's many more layers of function calls in plyrmr.

RevolutionAnalytics / plyrmr

fast vectorized summaries #22