Closed traversc closed 10 months ago
Thanks for reporting. I can reproduce this with:
library(future.apply)
benchmark <- function(title) {
n <- nbrOfWorkers()
## Burn in (to rule out potentional startup costs)
void <- future_lapply(1:n, FUN = identity)
gc()
dt <- system.time({
replicate(n = 10L, { void <- future_lapply(1:n, FUN = identity) })
})
cat(sprintf("%s:\n", deparse(attr(plan(), "call"))))
print(dt)
gc()
}
nworkers <- 8L
plan(multisession, workers = nworkers)
benchmark()
plan(sequential)
plan(cluster, workers = nworkers)
benchmark()
plan(sequential)
cl <- parallelly::makeClusterPSOCK(workers = nworkers)
plan(cluster, workers = cl)
benchmark()
plan(sequential)
parallel::stopCluster(cl)
which gives:
plan(multisession, workers = nworkers):
user system elapsed
2.777 0.031 3.283
plan(cluster, workers = nworkers):
user system elapsed
2.871 0.024 3.365
plan(cluster, workers = cl):
user system elapsed
5.083 0.052 5.587
This shows that plan(cluster, workers = n)
and plan(multisession, workers = n)
has the same overhead, as expected. The only difference between these two should be that multisession passes also rscript_libs = .libPaths()
to makeClusterPSOCK()
.
Now, what's interesting is that plan(cluster, workers = cl)
comes with a greater overhead than plan(cluster, workers = n)
. It's not clear to me why this would. I'm sure the reason is obvious, when it's been tracked down, but from a quick code inspection I couldn't spot anything. The two approaches creates the same type of cluster with the same options/parameters, so I think it's something else.
The "journal" plots created by future.tools, confirm the observations:
I'll try to investigate this further.
Thanks for detailing your thought process when debugging, Henrik. I know it must take extra time to write up things like this for us, but it's both educational (I keep notes on debugging issues with the future framework) and enjoyable from a curiosity perspective to see how you think through things.
Here's a reproducible example that does not depend on future.apply;
library(future)
benchmark <- function(title) {
dt <- system.time({
replicate(n = 50L, { void <- value(future(NULL)) })
})
cat(sprintf("%s:\n", deparse(attr(plan(), "call"))))
print(dt)
}
plan(cluster, workers = 1L)
benchmark()
plan(sequential)
cl <- parallelly::makeClusterPSOCK(workers = 1L)
plan(cluster, workers = cl)
benchmark()
plan(sequential)
parallel::stopCluster(cl)
which gives:
plan(cluster, workers = 1L):
user system elapsed
0.491 0.004 1.322
plan(cluster, workers = cl):
user system elapsed
1.086 0.022 3.373
I think I've narrowed it down to:
Basically, expr2
is bigger than expr1
below:
plan(cluster, workers = 1)
void <- value(f <- future(NULL))
expr1 <- getExpression(f)
cl <- parallelly::makeClusterPSOCK(1)
plan(cluster, workers = cl)
void <- value(f <- future(NULL))
expr2 <- getExpression(f)
causing workers = cl
to be passed along to each parallel worker in the latter case. So, the performance difference for every future launched is basically:
library(parallelly)
cl <- parallelly::makeClusterPSOCK(workers = 1L)
## plan(cluster, workers = 1L)
args <- list(workers = 1L)
dt <- system.time(void <- parallel::clusterExport(cl, "args"))
print(dt)
## plan(cluster, workers = cl)
args <- list(workers = cl)
dt <- system.time(void <- parallel::clusterExport(cl, "args"))
print(dt)
e.g.
user system elapsed
0.000 0.000 0.001
user system elapsed
0.004 0.000 0.015
I now have a prototype in the feature/getExpression-performance
branch that avoids this extra overhead. It can be installed as:
remotes::install_github("HenrikBengtsson/future", ref = "feature/getExpression-performance")
With the above benchmark example;
library(future)
benchmark <- function(title) {
dt <- system.time({
replicate(n = 50L, { void <- value(future(NULL)) })
})
cat(sprintf("%s:\n", deparse(attr(plan(), "call"))))
print(dt)
}
plan(cluster, workers = 1L)
benchmark()
plan(sequential)
cl <- parallelly::makeClusterPSOCK(workers = 1L)
plan(cluster, workers = cl)
benchmark()
plan(sequential)
parallel::stopCluster(cl)
we see that the difference is now gone;
plan(cluster, workers = 1L):
user system elapsed
0.325 0.001 0.399
plan(cluster, workers = cl):
user system elapsed
0.325 0.000 0.395
This new implementation passes all checks for:
I'll merge into the develop
branch when I feel 100% confident this won't have any unknown side effects.
I ran all the tests I can think of and the patch seems good. I've merged into the develop
branch, so it'll be part of the next release.
Thanks again for reporting.
Forgot to say, the overhead would grow with the number of parallel workers in the cluster
object.
Describe the bug
I am trying to use
plan(cluster)
in order to export globals and set options (https://github.com/HenrikBengtsson/future/issues/273), but it is a lot slower thanplan(multisession)
.Reproduce example
Running with
plan(multisession)
Running with
plan(cluster)
Expected behavior
I expect roughly the same performance for both plans.
Session information (Tested on various machines)