HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
951 stars 83 forks source link

Error when using envir & globals parameters in futures that are not using the multisession plan #280

Open avsdev-cw opened 5 years ago

avsdev-cw commented 5 years ago

See snippet below to produce the error:

library(future)
e <- new.env(parent = emptyenv())
e$test <- 123
f <- future({ paste("hello", test) }, envir = e, globals = ls(e), lazy = TRUE)
value(f)
# RESULT: Error in { : could not find function "{"
plan(multisession)
f2 <- future({ paste("hello", test) }, envir = e, globals = ls(e), lazy = TRUE)
value(f2)
# RESULT: [1] "hello 123"

My guess is there is some difference in the way the multisession and sequential/multicore etc is set up in terms of pre-loading libraries. Without digging into the source, I would guess that the multi-session sessions are set up with certain libraries loaded whereas the other plans are not. Perhaps when using the globals and envir parameters, the future library (or is it another library?) is injected.

NOTE: lazy = TRUE is not necessary, error still occurs, just on line 4 instead

avsdev-cw commented 5 years ago

NOTE: It also appears that using baseenv() exposes a more useful error (which shows future is indeed trying to be attached):

Error in packageVersion("future") : 
  could not find function "packageVersion"

Using the package wahani/modules there is a fix to get a minimal required environment based off emptyenv()/baseenv() by injecting the base & utils packages into the environment (base::library has no equivalent call)

library(future)
e <- new.env(parent = emptyenv())
e$test <- 123
modules::import(base, where = e) # exclude if using baseenv() as parent
modules::import(utils, where = e)
f <- future({ paste("hello", test) }, envir = e, globals = ls(e), lazy = TRUE)
value(f)
# RESULT: [1] "hello 123"
HenrikBengtsson commented 5 years ago

Thanks for reporting. So, I have this one on a (private) backburner issue tracker. The simple explanation is that the environment that local/sequential futures are evaluated in becomes the same as the environment specified to identify globals (here argument envir). This is an unfortunate side effect of the current sequential (and forked/multicore) evaluation implementation. Here's a base R snippet illustrating what's going on:

> e <- new.env(parent = emptyenv())
> eval(quote({ 42 }), envir = e)
Error in { : could not find function "{"

In order to fix this pothole, which is on the todo list, is to always copy over globals to the evaluation environment. This is done for all external workers but not the sequential ones. This will come with some extra overhead but will help avoid discrepancies like this one. The upside is that it will solve some other corn-case scenarios.

Q. I'm curious, but it'll also help me understand in what cases it causes a problem, what the underlying use case where you bumped into this issue?

avsdev-cw commented 5 years ago

To answer your question, we're working with some HUGE (1-5GB) shapefiles, which are being sectioned and parallel processed. In order to avoid the parallel processes either a) loading the shapefiles themselves and consuming the memory in no time at all or b) trying to copy the large origin environment across the futures (default future behaviour - also consumes memory) we are creating new environments each containing a section of the whole and forcing the futures to use these new environments.

One of the ways we can work around this (sort of...) is to create the data slices and save them to file, close the splitting R session, then have a new R session which creates and runs the 5000 futures (each future gets a filename in their environment to load & process), using resolve() (there is a bug in resolve() related to the progress function which I will open in another ticket when I have a chance). Memory management in R being so terrible that free'd memory isn't actually fully released is the real nub of our problem.

avsdev-cw commented 5 years ago

In order to fix this pothole, which is on the todo list, is to always copy over globals to the evaluation environment. This is done for all external workers but not the sequential ones. This will come with some extra overhead but will help avoid discrepancies like this one. The upside is that it will solve some other corn-case scenarios.

Rather than copy all globals across, would it not be better to use a minimal environment and assume either envir or the content of the future contains necessary library() calls to run (with the exception of base and utils::packageVersion or other methods needed to load the future library)? That way, by default envir = parent.frame() will still facilitate the future with currently loaded libraries and envir = custom_env will not rely the user/programmer having to load base/utils/future manually and thus be a true R --vanilla style environment? (The modules package for example only loads base).

Additionally, this could mean that a future could be as simple as future({ source('some_script.R') }, envir = emptyenv()) which would run the source'd script without any dependency on the current environment (ie: the same as Rscript --vanilla some_script.R)

EDIT: This would also help in debugging what you want the future to do. If you can call Rscript --vanilla some_script.R or given data(environment) stored in a file envir.RData R -e 'load("envir.RData"); source("some_script.R");' you can develop/test the future prior to your main parallel run.

HenrikBengtsson commented 5 years ago

Rather than copy all globals across, would it not be better to use a minimal environment and assume either envir or the content of the future contains necessary library() calls to run (with the exception of base and utils::packageVersion or other methods needed to load the future library)?

It's possible I misunderstand you but note that the default behavior is to automatically identify the objects ("globals") and packages needed to evaluate the future expression. It does not export all of the calling environment - only those globals needed. The packages required are assumed to be available on the worker, so those are loaded on the worker prior to evaluation.

Additionally, this could mean that a future could be as simple as future({ source('some_script.R') }, envir = emptyenv()) which would run the source'd script without any dependency on the current environment (ie: the same as Rscript --vanilla some_script.R)

Calling future({ source('some_script.R') }) will identify base::{ and base::source as globals, but since they're part of the base package they don't need to be exported ... and since base is always attached neither is base itself explicitly loaded.

It might be that you're misinterpreting what argument envir is meant to do. It is "The environment from where global objects should be identified." and it is not the environment where the expression is evaluated(). () The exception is what's said in https://github.com/HenrikBengtsson/future/issues/280#issuecomment-462451568.

To assert that a future can be evaluated "anywhere", I often use:

plan(cluster, workers = "localhost")

That will set up a single background worker process and any future evaluated will be have any globals required exported to the worker. If that fails, then you'll get an error.

From your description of your use case, can't you do something like;

fs <- list()
for (ii in seq_along(shapefiles)) {
  data <- my_read(shapefiles[ii])
  fs[[ii]] <- future({ do_something(data, ...) })
}
vs <- values(fs)

I use that model in a lot of my large-scale genomics analysis processing 100-1000's of files each 50+ GiB.

avsdev-cw commented 5 years ago

It's possible I misunderstand you but note that the default behavior is to automatically identify the objects ("globals") and packages needed to evaluate the future expression. It does not export all of the calling environment - only those globals needed.

Ah, this may be where I was falling foul at one point. Through bad coincidence, the large unsplit object outside the future had the same name as a variable inside the future, and thus I'm guessing the algorithm detecting the requirements picked up the large outer object and tried to pass it into ALL of the futures, (size error kicked off), which I didn't want and ended up trying to solve with a custom envir/globals). Controlling the evaluation environment is essentially what I am kind of doing by doing envir = e, globals = ls(e) and loading packages manually inside the future.

It doesn't help that throughout all this I have one package that loads it's own global "hidden" environment (ie starts with ".") for some of its functions which are then called as default parameters to a public function... causing mayhem.

EDIT1: In addition, I wonder if you have also noticed that running a list of futures with resolve() does not seem to always call the progress function? EDIT2: out of interest I ran with options(future.debug = TRUE) & options(mc.cores = 6) (6 of 48 cores) and here is a snippet (& my progress function):

progressPrint <- function(done, total) {
  time <- paste0("[", rfc3339(anytime(Sys.time())), "]")
  counts <- paste0("(", done, " of ", total, ")")
  percent <- paste0(round(done / total * 100, 1), "%")
  cat(time, "Resolving", counts, percent, "\n")
}
requestCore(): workers = 6
Poll #1 (0): usedCores() = 6, workers = 6
Poll #2 (1.4 secs): usedCores() = 6, workers = 6
Poll #3 (2.81 secs): usedCores() = 6, workers = 6
Poll #4 (4.22 secs): usedCores() = 6, workers = 6
Poll #5 (5.62 secs): usedCores() = 6, workers = 6
Poll #6 (7.04 secs): usedCores() = 6, workers = 6
Poll #7 (8.45 secs): usedCores() = 6, workers = 6
Poll #8 (9.86 secs): usedCores() = 6, workers = 6
Poll #9 (11.29 secs): usedCores() = 6, workers = 6
Poll #10 (12.71 secs): usedCores() = 6, workers = 6
Poll #11 (14.13 secs): usedCores() = 6, workers = 6
Poll #12 (15.56 secs): usedCores() = 6, workers = 6
Poll #13 (16.98 secs): usedCores() = 6, workers = 6
Poll #14 (18.41 secs): usedCores() = 6, workers = 6
Poll #15 (19.84 secs): usedCores() = 6, workers = 6
Poll #16 (21.28 secs): usedCores() = 6, workers = 6
Poll #17 (22.71 secs): usedCores() = 6, workers = 6
Poll #18 (24.15 secs): usedCores() = 6, workers = 6
Poll #19 (25.59 secs): usedCores() = 6, workers = 6
Poll #20 (27.03 secs): usedCores() = 6, workers = 6
Poll #21 (28.48 secs): usedCores() = 6, workers = 6
Poll #22 (29.92 secs): usedCores() = 6, workers = 6
Poll #23 (31.37 secs): usedCores() = 6, workers = 6
Poll #24 (32.83 secs): usedCores() = 6, workers = 6
Poll #25 (34.28 secs): usedCores() = 6, workers = 6
Poll #26 (35.74 secs): usedCores() = 6, workers = 6
Poll #27 (37.2 secs): usedCores() = 6, workers = 6
Poll #28 (38.66 secs): usedCores() = 6, workers = 6
Poll #29 (40.13 secs): usedCores() = 6, workers = 6
Poll #30 (41.59 secs): usedCores() = 6, workers = 6
Poll #31 (43.06 secs): usedCores() = 6, workers = 6
Poll #32 (44.54 secs): usedCores() = 6, workers = 6
Poll #33 (46.02 secs): usedCores() = 6, workers = 6
Poll #34 (47.49 secs): usedCores() = 6, workers = 6
Poll #35 (48.98 secs): usedCores() = 6, workers = 6
Poll #36 (50.46 secs): usedCores() = 6, workers = 6
Poll #37 (51.95 secs): usedCores() = 6, workers = 6
Poll #38 (53.44 secs): usedCores() = 6, workers = 6
Poll #39 (54.93 secs): usedCores() = 6, workers = 6
Poll #40 (56.42 secs): usedCores() = 6, workers = 6
Poll #41 (57.92 secs): usedCores() = 6, workers = 6
Poll #42 (59.42 secs): usedCores() = 6, workers = 6
Poll #43 (1.02 mins): usedCores() = 6, workers = 6
Poll #44 (1.04 mins): usedCores() = 6, workers = 6
Poll #45 (1.07 mins): usedCores() = 6, workers = 6
Poll #46 (1.09 mins): usedCores() = 6, workers = 6
Poll #47 (1.12 mins): usedCores() = 6, workers = 6
Poll #48 (1.14 mins): usedCores() = 6, workers = 6
Poll #49 (1.17 mins): usedCores() = 6, workers = 6
Poll #50 (1.19 mins): usedCores() = 6, workers = 6
plan(): Setting new future strategy stack:
List of future strategies:
1. multicore:
   - args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, workers = availableCores(constraints = "multicore"), earlySignal = FALSE, label = NULL, ...)
   - tweaked: FALSE
   - call: future::plan(future::multicore)

plan(): nbrOfWorkers() = 6
MulticoreFuture started
plan(): Setting new future strategy stack:
List of future strategies:
1. sequential:
   - args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, local = TRUE, earlySignal = FALSE, label = NULL, ...)
   - tweaked: FALSE
   - call: NULL

plan(): nbrOfWorkers() = 1
 length: 45 (resolved future 1)
[2019-02-12T18:20:02+0000] Resolving (19 of 64) 29.7% 
 length: 44 (resolved future 2)
[2019-02-12T18:20:02+0000] Resolving (20 of 64) 31.2% 
 length: 43 (resolved future 3)
[2019-02-12T18:20:02+0000] Resolving (21 of 64) 32.8% 
 length: 42 (resolved future 4)
[2019-02-12T18:20:02+0000] Resolving (22 of 64) 34.4% 
 length: 41 (resolved future 5)
[2019-02-12T18:20:02+0000] Resolving (23 of 64) 35.9% 
 length: 40 (resolved future 9)
[2019-02-12T18:20:02+0000] Resolving (24 of 64) 37.5% 
 length: 39 (resolved future 10)
[2019-02-12T18:20:02+0000] Resolving (25 of 64) 39.1% 
 length: 38 (resolved future 11)
[2019-02-12T18:20:02+0000] Resolving (26 of 64) 40.6% 
 length: 37 (resolved future 12)
[2019-02-12T18:20:02+0000] Resolving (27 of 64) 42.2% 
 length: 36 (resolved future 13)
[2019-02-12T18:20:02+0000] Resolving (28 of 64) 43.8% 
 length: 35 (resolved future 14)
[2019-02-12T18:20:02+0000] Resolving (29 of 64) 45.3% 
 length: 34 (resolved future 15)
[2019-02-12T18:20:02+0000] Resolving (30 of 64) 46.9% 
 length: 33 (resolved future 17)
[2019-02-12T18:20:02+0000] Resolving (31 of 64) 48.4% 
 length: 32 (resolved future 18)
[2019-02-12T18:20:02+0000] Resolving (32 of 64) 50% 
 length: 31 (resolved future 19)
[2019-02-12T18:20:02+0000] Resolving (33 of 64) 51.6% 
 length: 30 (resolved future 20)
[2019-02-12T18:20:02+0000] Resolving (34 of 64) 53.1% 
 length: 29 (resolved future 21)
[2019-02-12T18:20:02+0000] Resolving (35 of 64) 54.7% 
 length: 28 (resolved future 22)
[2019-02-12T18:20:02+0000] Resolving (36 of 64) 56.2% 
 length: 27 (resolved future 23)
[2019-02-12T18:20:02+0000] Resolving (37 of 64) 57.8% 
 length: 26 (resolved future 24)
[2019-02-12T18:20:02+0000] Resolving (38 of 64) 59.4% 
 length: 25 (resolved future 25)
[2019-02-12T18:20:02+0000] Resolving (39 of 64) 60.9% 
 length: 24 (resolved future 26)
[2019-02-12T18:20:02+0000] Resolving (40 of 64) 62.5% 
 length: 23 (resolved future 27)
[2019-02-12T18:20:02+0000] Resolving (41 of 64) 64.1% 
 length: 22 (resolved future 28)
[2019-02-12T18:20:02+0000] Resolving (42 of 64) 65.6% 
 length: 21 (resolved future 29)
[2019-02-12T18:20:02+0000] Resolving (43 of 64) 67.2% 
 length: 20 (resolved future 30)
[2019-02-12T18:20:02+0000] Resolving (44 of 64) 68.8% 
 length: 19 (resolved future 31)
[2019-02-12T18:20:02+0000] Resolving (45 of 64) 70.3% 
 length: 18 (resolved future 32)
[2019-02-12T18:20:02+0000] Resolving (46 of 64) 71.9% 
 length: 17 (resolved future 36)
[2019-02-12T18:20:02+0000] Resolving (47 of 64) 73.4% 
 length: 16 (resolved future 37)
[2019-02-12T18:20:02+0000] Resolving (48 of 64) 75% 
 length: 15 (resolved future 38)
[2019-02-12T18:20:02+0000] Resolving (49 of 64) 76.6% 
 length: 14 (resolved future 39)
[2019-02-12T18:20:02+0000] Resolving (50 of 64) 78.1% 
 length: 13 (resolved future 40)
[2019-02-12T18:20:02+0000] Resolving (51 of 64) 79.7% 
 length: 12 (resolved future 44)
[2019-02-12T18:20:02+0000] Resolving (52 of 64) 81.2% 
 length: 11 (resolved future 45)
[2019-02-12T18:20:02+0000] Resolving (53 of 64) 82.8% 
 length: 10 (resolved future 48)
[2019-02-12T18:20:02+0000] Resolving (54 of 64) 84.4% 
 length: 9 (resolved future 53)
[2019-02-12T18:20:02+0000] Resolving (55 of 64) 85.9% 
 length: 8 (resolved future 56)
[2019-02-12T18:20:02+0000] Resolving (56 of 64) 87.5% 
 length: 7 (resolved future 61)
[2019-02-12T18:20:02+0000] Resolving (57 of 64) 89.1% 
 length: 6 (resolved future 62)
[2019-02-12T18:20:02+0000] Resolving (58 of 64) 90.6% 
HenrikBengtsson commented 5 years ago

EDIT1: In addition, I wonder if you have also noticed that running a list of futures with resolve() does not seem to always call the progress function?

Thanks for this. I won't troubleshoot this one because that teeny progress bar feature is a remnant from early days that should not really be used and will be removed (hence Issue #282). The goal is to introduce a proper, generic mechanism for tracking the internal progress of futures, which will probably be based on some kind of hook functions (Roadmap Issue https://github.com/HenrikBengtsson/future/issues/172#issue-268516757)