HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
951 stars 83 forks source link

plan for OpenMP? #388

Open mllg opened 4 years ago

mllg commented 4 years ago

In mlr3, we are looping over the folds of a cross validation with future.apply. In each iteration, we are calling a function from an external library to fit a model. Many of the external libraries come with support for OpenMP, e.g. ranger or xgboost.

I was wondering if we could exploit the future API to set the number of OpenMP threads, e.g. via environment variable OMP_NUM_THREADS.

As an alternative, it would probably be sufficient for us if we could create a custom plan to store the number of workers to be used by the learners. We could then query the number of workers again and map it to the respective argument of the learner function. Is this already possible?

HenrikBengtsson commented 4 years ago

Can you pls clarify? I'm not sure I follow. Is this a long-term complicated design decision where you don't know what it would look like when implemented, or do you have a pretty clear view of what you want/need?

renkun-ken commented 4 years ago

you might want to try RhpcBLASctl::omp_set_num_threads?

mllg commented 4 years ago

Can you pls clarify?

I'll try to make my point with some pseudo code. In our package, we are running something very similar to the following code:

learner_ranger = function(data, target, train, test) {
  model = ranger::ranger(reformulate(".", target), data = data[train, ])
  pred = predict(model, data = data[test, ])
  mean(data[test, target] == pred$predictions)
}

resample = function(learner, data, target) {
  future.apply::future_sapply(1:10, function(i, data, target) {
    n = nrow(data)
    train = sample(1:n, 0.67 * n)
    test = setdiff(1:n, train)
    learner(data, target, train, test)
  }, data = data, target = target)
}

resample(learner_ranger, iris, "Species")

The user can now decide how to execute the resample() function by setting a plan(), e.g. to plan(multisession).

The ranger() function which fits a random forest has the argument:

num.threads: Number of threads. Default is number of CPUs available.

To control the number of threads, the user now has to set an environment variable. Alternatively, we can make num.threads an argument of resample() and pass it down to the learner. But introducing an additional argument here is confusing, not every model supports OMP. Additionally, it is hard to explain why we now have two ways to control the parallelization.

I'm now suggesting to introduce a plan to control threading, e.g.

plan(openmp, workers = 4)

This plan could behave like a sequential plan, but set the respective environment variable OMP_NUM_THREADS. This would also allow doing things like starting a cluster of (remote) workers, each spawning some OMP threads via

plan(list(cluster, openmp))
HenrikBengtsson commented 4 years ago

I think I understand what your needs are. One problem with a new openmp strategy is that OpenMP is not something that future()/value() can parallelize over. My gut feeling says that the number multithreading "workers" to use on each worker needs to be controlled via a additional argument when setting up the plan(), e.g.

plan(multisession, workers=3, threads_per_worker=4)

There's be discussions elsewhere of being able to set options and env vars per plan. Maybe this falls under that need.

I need to think about this one.