HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

loadbalancing, `plan(multisession)` and persistent workers. #63

Closed hans-ekbrand closed 2 years ago

hans-ekbrand commented 3 years ago

Thanks for future and doFuture!

I use these in a package, and on linux plan(multicore) and .future.options=list(prescheduling=FALSE) works perfect. But I have to support MS Windows too.

My use case is one where computation time is very uneven and for most jobs, the computation time is relatively short compared to the time of starting R. The program uses many foreach loops, so having a way to keep the workers between these loops would be great, but as far as I can tell, each worker is discarded when it is done with its job(s), is that correct?

For MS Windows, I use plan(multisession) and if I use prescheduling=FALSE, then it seems a completely new R instance is started for each job, which is very bad if the average computation time for a job is in the same magnitude as the computation time to start R and load the required libraries. So for now I use prescheduling=TRUE for Windows, and while it is not optimal it works pretty OK. Is there a better way to do it for me?

The real problem though, is that I can not figure out how to make the workers persistant, which is very frustrating since I have a computer with 80 logical cores, but starting 80 new instances of R for all my foreach loops is very slow.

This is my boilerplate code, and I have about 15 of these in the whole program. Is there a way to make the workers persistant through the whole program?

doFuture::registerDoFuture()
if(.Platform$OS.type == "unix") {
    plan(multicore)
    my.scheduling=FALSE
} else {
    plan(multisession, gc=TRUE, workers=n.cores)
    my.scheduling=TRUE
}
foreach::foreach(...,
                       .options.future = list(scheduling = my.scheduling)

Kind regards,

Hans Ekbrand

HenrikBengtsson commented 3 years ago

Hi.

... each worker is discarded when it is done with its job(s), is that correct?

Nah, plan(multisession) launches R workers in the background and keeps them around until you shut them down, which you can do by switching plan, e.g. plan(sequential) ...

I use plan(multisession) and if I use prescheduling=FALSE, then it seems a completely new R instance is started for each job,

... so, that's not a correct conclusion.

which is very bad if the average computation time for a job is in the same magnitude as the computation time to start R and load the required libraries. So for now I use prescheduling=TRUE for Windows, and while it is not optimal it works pretty OK.

Note that there's no option/argument called prescheduling, but I assume you meant scheduling as in your code snippet.

Have a look at Section 'Load balancing ("chunking")' in ?doFuture::doFuture. Specifically, note that you're using one of the two extremes right now:

Try with for instance .options.future = list(scheduling = 5.0). That way each worker will get approximately 5 chunks. That helps to deal with non-uniform runtimes for the different iterations.

This is my boilerplate code, and I have about 15 of these in the whole program. ...

Are you saying you're calling registerDoFuture() and calls plan(...) in each function call? If so, then, yes, you're most likely paying a large overhead from plan(multisession) setting up new workers in each function call. It's much better to leave it to the end-user to configure the plan() once before calling your functions.

HenrikBengtsson commented 2 years ago

Closing because no follow-up after more than a year.