HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone

https://future.futureverse.org

951 stars 83 forks source link

runtime performance comparison: future vs. pbmclapply vs. mclapply vs. foreach #146

Closed pat-s closed 7 years ago

pat-s commented 7 years ago

Hi Henrik,

I am sure you did a performance evaluation somewhere but I cannot find it here.

I did one within parsperrorest() which features 4 parallel modes (including also future() now).

Test case

devtools::install_github("pat-s/sperrorest@performance")

pacman::p_load(sperrorest)
data(ecuador) # Muenchow et al. (2012), see ?ecuador
fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope

nspres <- parsperrorest(data = ecuador, formula = fo,
                        model.fun = glm, model.args = list(family = "binomial"),
                        pred.args = list(type = "response"),
                        smp.fun = partition.cv,
                        smp.args = list(repetition = 1:50, nfold = 2),
                        par.args = list(par.mode = 1, par.units = 2),
                        benchmark = TRUE,
                        importance = TRUE, imp.permutations = 2)
nspres$benchmarks$runtime.performance

par.mode = 1 -> pbmclapply() (on Unix, pblapply() on Windows -> parApply())
par.mode = 2 -> foreach()
par.mode = 3 -> future_lapply()
par.mode = 4 -> mclapply()

Using two cores for every run (par.units = 2).

Results

future

taking future_lapply() and plan(multiprocess) (on macOS, so actually plan(multicore)
plan(multiprocess, workers = par.args$par.units + 1)

nspres$benchmarks$runtime.performance
Time difference of 3.317604 mins

pbmcapply

nspres$benchmarks$runtime.performance
Time difference of 2.977363 mins

foreach

nspres$benchmarks$runtime.performance
Time difference of 2.74343 mins

mclapply

nspres$benchmarks$runtime.performance
Time difference of 2.736444 mins

I did this on my local machine (MBP 2015, 2,7 Ghz, macOS 10.12.4) so there might be influences from other processes. Hence, runtime values here should not be taken too serious. However, any idea why future_lapply() performs so badly compared with all the others?

HenrikBengtsson commented 7 years ago

Thanks for this nice feedback / report - it's very useful.

Reason

To answer you're question what's causing future to be slower is because it ends up chunking up the *apply into W+1 chunks while there are only W workers processing the chunks. This means that there will always be a stray left-over chunk waiting for one of the workers to complete before it can be processed. You can see this if you use options(future.debug = TRUE) - you'll see that the last chunk will "poll" for an available worker for quite some time. Since each chunk processes 1/(W+1), that's basically the extra processing time penality you observe. With W = 4, we'll get ~20% longer processing time because of this, which agrees with your benchmark results. The more cores you have the less the effect is, e.g. W = 2 adds 33%, W = 4 adds 20%, W = 8 adds 11%, W = 16 adds 6%, and W = 64 adds 1.5%. With more cores it is also more likely that one of the chunk finishes a bit earlier so that that last stray chunk can start a bit earlier, meaning the effect is probably even smaller the more cores you use.

So, in summary, the workload balancing ("chunking") of future_lapply() is suboptimal and it happens when you use multiprocess (multicore and multisession). This is for conservative reasons. You can read about the details in Issue #7.

Solution

Anyway, it's my intention to fix this, and I honestly forgot that it hits future_lapply() badly (== 1/(W+1) extra time) when parallelizing on the current machine and with few cores. Until I've fixed that in the future package, I don't think there's an easy workaround for you to fix this on your. But since your example brought back this to my attention, I've added this to an issue that should be resolved in the next release.

PS. I like that your function automatically collects benchmark statistics. This is something I'd like to provide automatically for futures (Issue #59) such that users of futures (as well as I as the maintainer of the future package) easier can optimize performance.

HenrikBengtsson commented 7 years ago

This has now been fixed in the develop branch (= 1.4.0-9000, to become next release) of the package:

SIGNIFICANT CHANGES:

 o Multicore and multisession futures no longer reserve one core for the
   main R process, which was done to lower the risk for producing a higher
   CPU load than the number of cores available for the R session.

[...]

BUG FIXES:

 o future_lapply() with multicore / multisession futures, would use a
   suboptimal workload balancing where it split up the data in one chunk too
   many.  This is no longer a problem because of how argument 'workers' is
   now defined for those type of futures (see note on top).

You can install this version using:

remotes::install_github("HenrikBengtsson/future@develop")

Results

Using a tweaked version of your benchmark script:

# devtools::install_github("pat-s/sperrorest@performance")
library("sperrorest")
library("pbmcapply")

data("ecuador") # Muenchow et al. (2012), see ?ecuador
fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope

par.modes <- c(pbmclapply = 1, foreach = 2, future = 3, mclapply = 4)
for (name in names(par.modes)) {
  par.mode <- par.modes[name]
  par.units <- 2
  ## Due to SIGNIFICANT CHANGES, undo parsperrorest()'s +1 workers
  if (name == "future") par.units <- par.units - 1L  ## 
  nspres <- parsperrorest(data = ecuador, formula = fo,
                          model.fun = glm, model.args = list(family = "binomial"),
                          pred.args = list(type = "response"),
                          smp.fun = partition.cv,
                          smp.args = list(repetition = 1:50, nfold = 2),
                          par.args = list(par.mode = par.mode, par.units = par.units),
                          benchmark = TRUE,
                          importance = TRUE, imp.permutations = 2)
  dt <- nspres$benchmarks$runtime.performance
  message(sprintf("%-10s (par.mode = %d): %s", name, par.mode, format(dt)))
}

With the develop version:

pbmclapply (par.mode = 1): 2.297365 mins
foreach    (par.mode = 2): 2.130055 mins
future     (par.mode = 3): 2.20559 mins
mclapply   (par.mode = 4): 2.477031 mins

I'm closing, but please try / confirm on your end and if you're seeing something else, please feel free to re-open.

PS. If can can avoid using plan(multiprocess) in your code, but instead leave it to the user, then the user will have full control on where to run the analysis. For instance, s/he may use multiple machines, e.g. plan(cluster, workers = c("machine1", "machine2.remote.org")), or on a compute cluster, e.g. plan(future.batchtools::batchtools_sge). Alternatively, you could make it an option via par.args.

pat-s commented 7 years ago

Hi Henrik,

always impressed about your detailed and informative responses! A big thanks for fixing this and I'll definitely add future as one of the parallel modes. I'm not sure yet how much options and par.modes I want to provide for the user since the code should stay somewhat tidy.

I'm right now thinking about replacing the numbered arguments of par.mode to character ones so that the user directly knows what is being executed in the background. It would then be possible to just use par.mode = "multiprocess" or par.mode = "cluster". I think this is more informative than a numbered par.mode. Of course, I would need an additional par.args then which lets the user specify the plan() arguments - or just use ....

I also very much like the cluster idea - although this is a niche mode it can have huge impact if one has the infrastructure to use it. I will dig deeper into future and consider writing a vignette for parsperrorest() explaining all the parallel modes. I'd like to ask you for feedback than, especially on the future part of course.

Regarding the performances results of parsperrorest()for the glm() example I'm still surprised that foreach() takes the lead here - in my past performance evaluations mclapply()/pbapply() has always been significantly faster. I'll retry with different nfolds etc.

HenrikBengtsson commented 7 years ago

They all end up using forked processes internally, so if you benchmark multiple times and look at the distribution, I'm pretty sure you'll see very similar behavior across the board.

pat-s commented 7 years ago

FYI: https://github.com/pat-s/sperrorest/blob/dev/vignettes/parallel-modes.Rmd

future_lapply() takes the lead 👍

If you have any comments on the vignette/its wording, comments are always welcome :)

HenrikBengtsson commented 7 years ago

Thanks for the feedback. That looks great.

One comment though: I'm personal very hesitant (read conservative) about having R packages running in parallel by default. The main problem is that more and more packages get built-in parallel processing these days and if all enable parallel processing, using two or more cores by default, there is an increasing risk that the number of cores used by the main R sessions will explode because of recursive multicore parallelization. I designed the Future API/future package to protect against this as far as I can, so you and your users should be fine since you're running everything through that. Unfortunately, we can't prevent a user from another package running, say, mclapply(x, foo, mc.cores = parallel::detectCores()) where foo() calls your code. That would effectively use up parallel::detectCores() * future::availableCores() cores. For a user running on there own machine, this won't be a catastrophe, but when you have multiple users sharing one or more compute nodes, where users are not even aware that things run in parallel, it can become pretty bad.

I don't see an easy solution for this as long as the R / CRAN community can agree on a common convention for this. At the time, my preference is that all packages should run sequentially ("same expectation for everyone") and the user has to actively request parallel processing. If R would provide a built-in protection against recursive parallelism (similarly to what I try in Future API), I would be less concerned. Then it would only be what the default number of cores should be.

About performance comparisons: I wouldn't draw too big conclusions about what backend is much faster than the other, because in the end of the day they're basically running backends the same way under the hood. They differ somewhat in how they prepare / set up jobs (e.g. future do static code analysis to find global variables) so they may have somewhat different overhead in that sense, but overall they should all play in the same league when it come to processing performance.

pat-s commented 7 years ago

Thanks for your comment. My current experience is that most users of the field in which this package is most likely being used (ecological modelling) are more on the applied side of modelling and do not have the knowledge of quickly setting up mclapply(x, foo, mc.cores = parallel::detectCores()). Also, quite some people asked me how to set up parallelism, not knowing that it was already built into v.1.0.0. This somehow frustrated me and got me to the point to enable parallel processing by default.

Before doing any mclapply(x, foo, mc.cores = parallel::detectCores()) attempts I hope that every user has read the help file/package description/vignette at least once which should prevent 99% of these cases. In contrast to other parallel packages, which use parallelism unter the hood and do not prominently tell it in their package / mix sequential/parallel functions, I think(hope) sperrorest talks enough about its parallelism attribute. Furthermore, there is only one working horse function and no mixes of multiple sequential/parallel working horse functions. Nevertheless, there is no guarantee to prevent such cases, of course.

For the moment I see more advantages than disadvantages here. Let's see how often such cases become public.

Thanks for note on the performances. Yes, foreach takes quite some time to start the processing since it starts workers sequentially but it also provides the most progress tracking ability. Well, every mode has its advantages/disadvantages so the user is free to choose ;)

HenrikBengtsson commented 7 years ago

I understand - and since we all (users and devels) have different objectives it'll be really hard to find a consensus of what should be default design strategy. I'm still concerned that'll we end up with a "wild wild west", because it is really hard to predict where and how a package will be used in the future. Maintainers may move on, but the package still actively used.

In the aroma framework, I prepare everything for parallel processing, but leaves it to the user to enable it with a single line of code (http://www.aroma-project.org/howtos/parallel_processing/), e.g.

future::plan("multiprocess")

Not trying to change your and other developers design decisions, but I'd like to bring awareness to this problem (which already is a real problem with some R packages running on multi-tenant compute environments).

pat-s commented 7 years ago

Not trying to change your and other developers design decisions, but I'd like to bring awareness to this problem (which already is a real problem with some R packages running on multi-tenant compute environments).

Definitely an important point and you are in the position to highlight it because you are the creator of future, which will (hopefully) be the future of parallelism in R.

I understand - and since we all (users and devels) have different objectives it'll be really hard to find a consensus of what should be default design strategy. I'm still concerned that'll we end up with a "wild wild west", because it is really hard to predict where and how a package will be used in the future. Maintainers may move on, but the package still actively used.

ATM I would also assume that the best possible fix here would be to catch recursive parallelism both in R Core (parallel) and future to prevent such cases and stop execution.

In the aroma framework, I prepare everything for parallel processing, but leaves it to the user to enable it with a single line of code

This is great and I also did it similar in sperrorest v.1.0.0 using

par_args = list(par_mode = 2)

but it still seemed to be overlooked in practice. Of course, then one could just say "this is your fault then" - but I preferred to force people to save time instead of overlooking things and having to wait ;-)