Closed arunsrinivasan closed 5 years ago
Update.. Using lapply
together with future
handles ignore
correctly.
require(future)
require(parallel)
cl <- makeClusterPSOCK(30)
plan(cluster, workers=cl, persistent=TRUE)
set.seed(1L)
x <- runif(1e7)
clusterExport(cl, "x")
system.time(ans1 <- values(lapply(1:30, function(i) future(x[1L]))))
# user system elapsed
# 3.19 3.53 15.21
system.time(ans2 <- values(lapply(1:30, function(i) future(x[1L], globals=structure(TRUE, ignore="x")))))
# user system elapsed
# 0.18 0.00 0.19
identical(ans1, ans2)
# [1] TRUE
My guess therefore is that issue 1
seems to be a bug in future_lapply
's handling of future.globals
argument.
Thanks for this. You're correct, attributes add
and ignore
of argument future.globals
are lost. For TRUE
it happens here:
and for FALSE
they're ignored here:
This looks easy to fix - I'll add some unit tests confirming the current bug(s) and then I'll fix this for next release.
Fixed in the develop branch. Install via:
remotes::install_github("HenrikBengtsson/future.apply@develop")
Example:
library(future.apply)
cl <- makeClusterPSOCK(1L)
plan(cluster, workers = cl, persistent = TRUE)
## Use this `x` on worker(s)
x <- 1
parallel::clusterExport(cl, "x")
## Use this as a global
a <- 42
y <- future_lapply(1:2, function(i) x*a, future.globals=structure(TRUE, ignore="x"))
str(y)
# List of 2
# $ : num 42
# $ : num 42
## But do NOT use this `x`
x <- 2
y <- future_lapply(1:2, function(i) x*a, future.globals=structure(TRUE, ignore="x"))
str(y)
# List of 2
# $ : num 42
# $ : num 42
Thanks for the fix. I can confirm it runs in 0.2s.
Regarding the other issue, I still feel that having to write 'ignore' in every function call could be alleviated by having an option that prioritises vars in the local nodes (looks in the local node first and then the master node, or simply errors if var not found in local node)..
Your use case touches on a much bigger design feature request (that I silently wiped under the rug above). Support for worker-specific globals/constants falls under https://github.com/HenrikBengtsson/future/issues/172, e.g.:
Optional Future API
- Persistent workers, i.e. a future can change the state of an underlying worker that a following future can utilize.
- this can be for efficiency, e.g. futures that share the same global variables may resolve faster if they are resolved by the same worker (this can be optional, i.e. export global if not already available; think memoization)
- a future preserve a value for a downstream future (not sure if this fits into the concept of futures, but I'll add it here in case someone have thoughts about this)
It wouldn't be "too hard" to implement something specific for cluster backends. However, the main issue is the design of the API that will allow your future code to work also backends that does not support persistent states (*), e.g. if a worker does not already have a value locally (e.g. verified by checksum), then it will/need to be exported. (These ideas also leads into performance optimization based on caching and memoization.)
To further clarify https://github.com/HenrikBengtsson/future/issues/172, the current (Core) Future API is designed to work the same everywhere regardless on future backend. By introducing additional features - Optional Future API - we introduce the risk that some code will only work on certain backends, which we wish to avoid as far as ever possible. Receiving feedback and collection feature requests like yours, while moving carefully, the hope is to slowly add support for new features without breaking existing code/backends.
(*) This actually already becomes a problem when using future.globals=structure(TRUE, ignore="x"))
as used above. FYI, the main purpose of add
and ignore
is to workaround corner cases where the automatic identification of globals produce false negative or false positive - it was not really intended for persistent globals/constants.
FYI, I've added the following to https://github.com/HenrikBengtsson/future/issues/172:
Persistent workers, i.e. a future can change the state of an underlying worker that a following future can utilize.
- efficiency: don't export globals that already exist on worker - requires method for asserting
identical(local, remote)
.
This allows me to close this issue, where the main problem was a bug fixed in the next release.
I'm not sure if the scope of
future
andfuture.apply
fall into the use case scenario that I experience or I have somehow skipped as to how to go about it, but here goes.Suppose I've a variable sitting in the global environment:
This object is not too large, but takes significant time if I were to refer to this variable to do some computations on say, 30 nodes:
It takes considerable time because
x
is copied on to every node. And it makes sense since it is the default option. But suppose I did this instead:Now, I've loaded 'x' on to all the clusters so that I can use them instead..
The point is that, if we are going to use objects of considerable size repetitively, then it makes sense to load them once on to all the nodes and use them instead.
However, the above would not work if I've a global to refer to in the same function as well...
Issue 1:
To get this to work, I'd then have to do:
which I thought should be quick as it should ignore 'x' in the global environment. But it seems to load still. Maybe my understanding isn't correct as to what 'ignore' is supposed to do?
Issue 2:
Assuming my understanding is correct, suppose I've 10 such objects that I use continuously, then I've to keep ignoring in every function call.
I thought of using
options(future.globals.method="conservative")
would solve my issue.. but I think it needs the variable to be assigned within the function call for it to be not loaded from global environment as shown in the example of?globalsOf
.. I'm not sure about this as well though but this doesn't get the timings to go down as well.So my question is, if you've a considerably sized object which needs to be used repeatedly every day, almost every 5 minutes, say, what's the best approach to not keep loading that as a global to the nodes every time.. Would maybe having a
global.options.method="priority_local"
an option where if the same variable exists already in the nodes, then it's used directly and global var is ignored..