Question: best way to prioritise cluster local vars if they exist

arunsrinivasan commented 5 years ago

I'm not sure if the scope of future and future.apply fall into the use case scenario that I experience or I have somehow skipped as to how to go about it, but here goes.

Suppose I've a variable sitting in the global environment:

# Windows machine with 48 nodes running Windows Server 2012
set.seed(1L)
x <- runif(1e7) # > 50MB

This object is not too large, but takes significant time if I were to refer to this variable to do some computations on say, 30 nodes:

require(future)
require(future.apply)
plan(multisession, workers=30L)
system.time(future_lapply(1:30, function(i) x[1]))
#    user  system elapsed 
#    3.63    2.26   15.32

It takes considerable time because x is copied on to every node. And it makes sense since it is the default option. But suppose I did this instead:

# restart session
require(future)
require(future.apply)
require(parallel)
set.seed(1L)
x <- runif(1e7)
cl <- makeClusterPSOCK(30L)
plan(cluster, workers=cl, persistent=TRUE)
clusterExport(cl, "x") # takes noticeable time, as expected

Now, I've loaded 'x' on to all the clusters so that I can use them instead..

system.time(future_lapply(1:30, function(i) x[1L], future.globals=FALSE))
#    user  system elapsed 
#    0.19    0.03    0.22

The point is that, if we are going to use objects of considerable size repetitively, then it makes sense to load them once on to all the nodes and use them instead.

However, the above would not work if I've a global to refer to in the same function as well...

a <- 2L
system.time(future_lapply(1:30, function(i) x[1L]*a, future.globals=FALSE))
# Error in ...future.FUN(...future.X_jj, ...) : object 'a' not found

Issue 1:

To get this to work, I'd then have to do:

system.time(future_lapply(1:30, function(i) x[1]*a, future.globals=structure(TRUE, ignore="x")))
#    user  system elapsed 
#    3.33    2.13   21.34

which I thought should be quick as it should ignore 'x' in the global environment. But it seems to load still. Maybe my understanding isn't correct as to what 'ignore' is supposed to do?

Issue 2:

Assuming my understanding is correct, suppose I've 10 such objects that I use continuously, then I've to keep ignoring in every function call.

I thought of using options(future.globals.method="conservative") would solve my issue.. but I think it needs the variable to be assigned within the function call for it to be not loaded from global environment as shown in the example of ?globalsOf.. I'm not sure about this as well though but this doesn't get the timings to go down as well.

So my question is, if you've a considerably sized object which needs to be used repeatedly every day, almost every 5 minutes, say, what's the best approach to not keep loading that as a global to the nodes every time.. Would maybe having a global.options.method="priority_local" an option where if the same variable exists already in the nodes, then it's used directly and global var is ignored..

arunsrinivasan commented 5 years ago

Update.. Using lapply together with future handles ignore correctly.

require(future)
require(parallel)
cl <- makeClusterPSOCK(30)
plan(cluster, workers=cl, persistent=TRUE)
set.seed(1L)
x <- runif(1e7)
clusterExport(cl, "x")

system.time(ans1 <- values(lapply(1:30, function(i) future(x[1L]))))
#   user  system elapsed 
#   3.19    3.53   15.21 
system.time(ans2 <- values(lapply(1:30, function(i) future(x[1L], globals=structure(TRUE, ignore="x")))))
#   user  system elapsed 
#   0.18    0.00    0.19 

identical(ans1, ans2)
# [1] TRUE

My guess therefore is that issue 1 seems to be a bug in future_lapply's handling of future.globals argument.

HenrikBengtsson commented 5 years ago

Thanks for this. You're correct, attributes add and ignore of argument future.globals are lost. For TRUE it happens here:

https://github.com/HenrikBengtsson/future.apply/blob/e5a4fdc9bffc778ea1b4651cc2cf323b8cd37654/R/globals.R#L18

and for FALSE they're ignored here:

https://github.com/HenrikBengtsson/future.apply/blob/e5a4fdc9bffc778ea1b4651cc2cf323b8cd37654/R/globals.R#L30-L31

This looks easy to fix - I'll add some unit tests confirming the current bug(s) and then I'll fix this for next release.

HenrikBengtsson commented 5 years ago

Fixed in the develop branch. Install via:

remotes::install_github("HenrikBengtsson/future.apply@develop")

Example:

library(future.apply)

cl <- makeClusterPSOCK(1L)
plan(cluster, workers = cl, persistent = TRUE)

## Use this `x` on worker(s)
x <- 1
parallel::clusterExport(cl, "x")

## Use this as a global
a <- 42

y <- future_lapply(1:2, function(i) x*a, future.globals=structure(TRUE, ignore="x"))
str(y)
# List of 2
#  $ : num 42
#  $ : num 42

## But do NOT use this `x` 
x <- 2
y <- future_lapply(1:2, function(i) x*a, future.globals=structure(TRUE, ignore="x"))
str(y)
# List of 2
#  $ : num 42
#  $ : num 42

arunsrinivasan commented 5 years ago

Thanks for the fix. I can confirm it runs in 0.2s.

Regarding the other issue, I still feel that having to write 'ignore' in every function call could be alleviated by having an option that prioritises vars in the local nodes (looks in the local node first and then the master node, or simply errors if var not found in local node)..

HenrikBengtsson commented 5 years ago

Your use case touches on a much bigger design feature request (that I silently wiped under the rug above). Support for worker-specific globals/constants falls under https://github.com/HenrikBengtsson/future/issues/172, e.g.:

Optional Future API

Persistent workers, i.e. a future can change the state of an underlying worker that a following future can utilize.

this can be for efficiency, e.g. futures that share the same global variables may resolve faster if they are resolved by the same worker (this can be optional, i.e. export global if not already available; think memoization)

a future preserve a value for a downstream future (not sure if this fits into the concept of futures, but I'll add it here in case someone have thoughts about this)

It wouldn't be "too hard" to implement something specific for cluster backends. However, the main issue is the design of the API that will allow your future code to work also backends that does not support persistent states (*), e.g. if a worker does not already have a value locally (e.g. verified by checksum), then it will/need to be exported. (These ideas also leads into performance optimization based on caching and memoization.)

To further clarify https://github.com/HenrikBengtsson/future/issues/172, the current (Core) Future API is designed to work the same everywhere regardless on future backend. By introducing additional features - Optional Future API - we introduce the risk that some code will only work on certain backends, which we wish to avoid as far as ever possible. Receiving feedback and collection feature requests like yours, while moving carefully, the hope is to slowly add support for new features without breaking existing code/backends.

(*) This actually already becomes a problem when using future.globals=structure(TRUE, ignore="x")) as used above. FYI, the main purpose of add and ignore is to workaround corner cases where the automatic identification of globals produce false negative or false positive - it was not really intended for persistent globals/constants.

HenrikBengtsson commented 5 years ago

FYI, I've added the following to https://github.com/HenrikBengtsson/future/issues/172:

Persistent workers, i.e. a future can change the state of an underlying worker that a following future can utilize.

efficiency: don't export globals that already exist on worker - requires method for asserting identical(local, remote).

This allows me to close this issue, where the main problem was a bug fixed in the next release.

futureverse / future.apply

Question: best way to prioritise cluster local vars if they exist #37

Issue 1:

Issue 2:

Optional Future API