HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
955 stars 84 forks source link

DEPRECATION: local=FALSE and persistent=TRUE #382

Closed HenrikBengtsson closed 4 years ago

HenrikBengtsson commented 4 years ago

I'm planning to phase out support for local = FALSE and possibly also persistent = TRUE on cluster futures. I'm posting this issue to clarify that this is on the roadmap and to provide a location for discussion this move. If someone out there make use of local = FALSE or persistent = TRUE, I'd like to hear how and why. It might be that one solution is to provide workarounds with future backend solutions that support these but are not fully compliant to the Future API.

The main reason for deprecating local = FALSE is that it may affect the results and it is not supported by all backend. This is not safe because as a developer you don't know what backend the end-user will use and as an end-user you might break the developer's intentions if you set it for a backend. It was added to the Future API in the early days during its "experimental" phase. Also, keeping support for argument local prevents further development of the Future API.

The main reason for persistent = TRUE is it's use in remote futures where you want to control a remote R session as if it was your main one, but control it from your local one. An example of this can be seen in https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html. I also think that some people use

cl <- future::makeClusterPSOCK(...)
plan(cluster, workers = cl, persistent =TRUE)
parallel::clusterExport(cl, c("some", "large", "global", "objects"))

to avoid large objects being exported/transferred each time some "heavy" future are launched. If this is a common use case, please note that I've got some ideas on adding support for sticky globals (https://github.com/HenrikBengtsson/future/issues/339#issuecomment-533730606). For this to work across the board and with all backends, this feature needs to be optional, i.e. globals are sticky and cached on workers if supported and up-to-date, otherwise they are re-exported. With support for the latter, the need of the above kind of persistent = TRUE is less needed. Support for persistent = TRUE is a bit of a hack and like local it blocks taking the Future API to the next level.

renkun-ken commented 4 years ago

Thanks for the idea of sticky globals.

In my use case in practice, I need both sticky globals exported to all cluster nodes in the beginning. They will not be changed any more.

And in each iteration every couple of minites, there are several less sticky variables that needs to be exported to all cluster nodes, such as the current iteration number, timestamp, etc, all nodes need to sync these variables before they start to process their tasks. To attach a list of objects in the search path looks nice when they are permenent and not subject to change. But what about these variables that actually needs update each time? Is it compatible with the roadmap or design of future to use clusterExport each time before starting tasks?

HenrikBengtsson commented 4 years ago

Is it compatible with the roadmap or design of future to use clusterExport each time before starting tasks?

Calling clusterExport() just before creating futures is not really part of the Future API and how futures should be used - simply because you shouldn't make assumptions about future/parallel backends when using futures. For instance, your code would break if plan(sequential) would be used.

Maybe you meant calling clusterExport() when setting up a parallel PSOCK cluster. Then, yes, in some sense it should theoretically be compatible with what I'm planning on doing. But, rather than exporting objects to the workers' global environment, it's probably safer to put those objects in a separate environment on the search() path as was prototyped in https://github.com/HenrikBengtsson/future/issues/339#issuecomment-533730606.

renkun-ken commented 4 years ago

Yes, I'm only talking about PSOCK cluster. Since PSOCK cluster is the only backend that works with my use case and I know exactly what I'm doing with it but some particular operations have to be done to make it more performant as needed. I'm curious that if the compatibility of different backend is top concern, and the real difference of operating on different backends are not accessible by user, it would be an obvious limitation for user to take advantage a certain backend.

HenrikBengtsson commented 4 years ago

My goal, without having to give up on the essential design objective that all future code should be able to run on any type of future backends, is to make it possible for these type of needs to be more efficient on certain backends than others. A caching framework for globals which may only apply to certain backends is one example of this.

In the ideal world, such caching could be fully automated. For example, instead of exporting all globals (as done now), the main R process could checksum the global objects and ask the worker whether it already has such an object or not. If it already exists on the worker, then the object does not have to be transferred (and thereby saving time and I/O). If it already exists, it will be transferred as now. What is complicated is how to specify which globals could/should be cached and which shouldn't. This would require some additional syntax. Caching all objects can quickly consume too much memory. Then we some type of syntax for communicating when the cache can be cleared.

I should also say, moving toward persistent = FALSE and making that the only option, does not preventing you from storing data persistently on the worker. persistent = FALSE/TRUE only controls whether the global environment should be wiped on the worker or not. Some rudimentary form of "sticky globals" will be in place before dropping support for persistent = TRUE. Maybe even by introducing helper functions such as clusterExportSticky() and clusterEraseSticky().

dwachsmuth commented 4 years ago

I'm currently using persistent = TRUE in order to set separate proxy servers for multithreaded web scraping functions I use. The use case is a little complex, but basically the Selenium headless browser looks for a list of arguments in the calling environment, and when I want to launch n browsers (one per future), I initialize the clusters first, put the proxy server information into each future with parallel::clusterApply, and then use foreach with %dopar% to do the scraping within the futures.

The relevant code looks like:

cl <- parallel::makeCluster(cores, setup_strategy = "sequential")
future::plan(future::cluster, workers = cl, persistent = TRUE)
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
parallel::clusterApply(cl, proxies, function(x) {proxy <<- x})
foreach(i = seq_along(URLS_to_scrape)) %dopar$ {
  # Do the scraping here
}

Would it be possible to supply the proxy servers as globals, but send separate globals to each future? (That's the key behaviour I need in this use case: each future should have a different proxy server specified.)

HenrikBengtsson commented 4 years ago

Hi. Thanks for reaching out with your usecase. Yes, you'll be able to achieve what you need also without using persistent=TRUE. Below is how you can, and actually should, do it already now.

I took the freedom to rewrite what you're doing right now, to help clarify what can be done:

## Set up cluster workers
cl <- parallel::makeCluster(4L, setup_strategy = "sequential")

## Customize each worker
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
stopifnot(length(proxies) == length(cl))  ## Implicit assumption
dummy <- parallel::clusterApply(cl, proxies, function(x) {
  config <- list(proxy=x)                                    
  env <- globalenv()                           
  env[["config"]] <- config
})

## Use these cluster workers for futures
future::plan(future::cluster, workers = cl, persistent = TRUE)

## The foreach to parallelize with futures
library(doFuture)
registerDoFuture()

URLS_to_scrape <- 1:8
res <- foreach(i = seq_along(URLS_to_scrape)) %dopar% {
  # Do the scraping here
  data.frame(i=i, proxy=config$proxy)
}
res <- do.call(rbind, res)
print(res)
###   i proxy
### 1 1     A
### 2 2     A
### 3 3     B
### 4 4     B
### 5 5     C
### 6 6     C
### 7 7     D
### 8 8     D

Now, the main reason for you using persistent=TRUE above is actually that futures do not automatically clean up the global environment of the workers. If that wouldn't take place, you would have to use persistent=TRUE.

Now, you can "attach" variables elsewhere on the search() path, where the easiest way to to create a new environment on position 2 immediately after the global environment. This way, if config is neither in the local environment nor the global environment, then it will be found in this custom environment. Here is how you do this:

## Set up cluster workers
cl <- parallel::makeCluster(4L, setup_strategy = "sequential")

## Cusomize each worker
proxies <- c("A", "B", "C", "D") # Actual proxy servers are specified here
stopifnot(length(proxies) == length(cl))  ## Implicit assumption
dummy <- parallel::clusterApply(cl, proxies, function(x) {
  config <- list(proxy=x)                                    
  attach(list(config=config), name="worker-config")
})

## Use these cluster workers for futures
future::plan(future::cluster, workers = cl)

## The foreach to parallelize with futures
library(doFuture)
registerDoFuture()

URLS_to_scrape <- 1:8
res <- foreach(i = seq_along(URLS_to_scrape)) %dopar% {
  # Do the scraping here
  data.frame(i=i, proxy=config$proxy)
}
res <- do.call(rbind, res)

Note how there is no persistent=TRUE above.

... (That's the key behaviour I need in this use case: each future should have a different proxy server specified.)

Just to be 100% clear here; Because of chunking, one future may process multiple foreach iterations, as the above res illustrates. I assume you meant to say: "each future worker should have a different proxy server specified."

dwachsmuth commented 4 years ago

Thanks so much for the response. I think you've nailed it here; you're right that I meant that each worker should have a different proxy server specified, and your suggested rewrite works perfectly! I will remove the persistent = TRUE argument in my next update.

HenrikBengtsson commented 4 years ago

HEADS UP: The next step on this issue will be to deprecate local = FALSE except when persistent = TRUE. This will help phase out the local argument without breaking the current use of persistent = TRUE of 'cluster' futures. The deprecation of argument local will probably start by:

if (!local) {
  .Deprecated(msg = "Using 'local = FALSE' is deprecated and will soon be defunct and produce an error")
}

for all cases where persistent is not used or it is FALSE.

Deprecation of persistent = TRUE will come at a later stage when I/we know all needs can be met.

HenrikBengtsson commented 4 years ago

I've deprecated local = FALSE in the next release (commit 549ac88). The only allowed exception for this is when using persistent=TRUE with cluster futures. (The latter will eventually be deprecated as well, but it'll take a bit longer)

HenrikBengtsson commented 4 years ago

I'm closing this one; local = FALSE is now deprecated as far as possible. The remaining part of deprecating persistent = TRUE is tracked by Issue #433.