futureverse / future.apply

:rocket: R package: future.apply - Apply Function to Elements in Parallel using Futures
https://future.apply.futureverse.org
211 stars 16 forks source link

Strip environment from exported function #99

Closed odelmarcelle closed 2 years ago

odelmarcelle commented 2 years ago

Related to #98. Reduce the amount of data transferred to other clusters by stripping the enclosing environment from the exported function. This drastically fastens the exportation to PSOCK clusters and does not seem to create any issue.

Benchmark before/after

Setup

set.seed(123)
long_characters <- setNames(replicate(
  10000,
  paste0(c(letters, rep(" ", 10))[
    sample.int(length(letters) + 1, 5000, replace = TRUE)],
    collapse = ""),
  simplify = FALSE
), paste0("element", 1:10000))
object.size(long_characters) |> format("Mb")
#> [1] "49.5 Mb"

some_function1 <- function(x) {
  identity(x)
}
some_function2 <- function(x) {
  future_lapply(x, function(y) {
    identity(y)
  })
}

Before changes

library(future)
library(future.apply)
library(foreach); library(doFuture); registerDoFuture()
plan(multisession, workers = 4)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 10
)
benchmark[, c(1:5, 7)]
#> # A tibble: 4 x 5
#>   expression      min   median `itr/sec` mem_alloc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 1          234.85ms 557.59ms     1.48     5.04MB
#> 2 2          243.29ms 555.57ms     1.80     2.35MB
#> 3 3          992.29ms    1.66s     0.600    2.35MB
#> 4 4             1.23s    1.33s     0.692   53.33MB

Created on 2022-03-12 by the reprex package (v2.0.1)

After changes

library(future)
library(future.apply)
library(foreach); library(doFuture); registerDoFuture()
plan(multisession, workers = 4)
benchmark <- bench::mark(
  `1` = future_lapply(long_characters, identity),
  `2` = future_lapply(long_characters, some_function1),
  `3` = some_function2(long_characters),
  `4` = foreach(x = long_characters, .final = function(x) setNames(x, names(long_characters))
  ) %dopar% {some_function1(x)},
  filter_gc = FALSE,
  min_iterations = 10
)
benchmark[, c(1:5, 7)]
#> # A tibble: 4 x 5
#>   expression      min   median `itr/sec` mem_alloc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 1          248.66ms 401.04ms     2.14     5.91MB
#> 2 2          243.32ms  249.3ms     2.91     2.81MB
#> 3 3          244.78ms 399.29ms     2.32     2.79MB
#> 4 4             1.24s    1.29s     0.708   53.33MB
odelmarcelle commented 2 years ago

After some testing, it does not work as expected because ...future.FUN is still evaluated with it's own environment. So it doesn't seem to make used of exported globals. I will look a bit for a way to solve that.

odelmarcelle commented 2 years ago

I don't see how to solve the issue safely for all possible usage of future.apply. A satisfactory solution would probably require numerous changes to future.apply and the cluster class of future.

In the absence of an easy-to-implement solution, I'm closing this pull request. If future.apply is not capable of efficiently stripping the unwanted function environment, it is up to the user to be careful about the environment of the provided function.