Open dipterix opened 4 years ago
This might be a dumb question, ...
You just broke Rule 1: There are no dumb questions. period. ;)
It's tempting to do something like:
> library(future)
> parent_pid <- Sys.getpid()
> f <- future({ is_child <- (Sys.getpid() != parent_pid); is_child })
> value(f)
[1] FALSE
> plan(multisession, workers = 2)
> f <- future({ is_child <- (Sys.getpid() != parent_pid); is_child })
> value(f)
[1] TRUE
Unfortunately, that's not reliable across all future backends, e.g. if the worker is running on another machine you might end up with identical PIDs by chance. When running in Docker containers on the same machine, you might see that all PIDs == 1.
A safer approach is something like:
> library(future)
> parent_uuid <- future:::session_uuid()
> f <- future({ is_child <- (future:::session_uuid() != parent_uuid); is_child })
> value(f)
[1] FALSE
> plan(multisession, workers = 2)
> f <- future({ is_child <- (future:::session_uuid() != parent_uuid); is_child })
> value(f)
[1] TRUE
Now, future:::session_uuid()
is not part of the public API so use with great care, if at all; I won't promise it will remain. I started https://stat.ethz.ch/pipermail/r-devel/2019-May/077831.html in an attempted to get this into base R, because it believe it would be valuable outside of the future framework as well.
Having said all this, more importantly, you ideally want your future code to work regardless of what backend you use, i.e. it should work in sequential processing equally well as when parallelizing on machines on the other side of the world. So, the real answer to your question depends on what your real use case is. What are you trying to solve/avoid by knowing this?
I'm using Sys.getpid()
as temporary solution. As you have mentioned, this is not very stable/consistent.
Right now I'm dev a package that requires intensive file IOs. The data requires one or more dedicated sessions to process, clean, and reshape data. The main session is non-blocked.
The catch is: if the function requiring intensive IOs is called in the main session, it will just push to a queue, letting idle sessions to handle them. If the function is called in slave nodes, then run normally.
I know in future
package, all futures ran in slave sessions are sequential, but there are some parameter settings that are different and the easiest way is to have UUID.
Another question is, is there any suggestions to detach some nodes from future for dedicated uses? Let's say you plan
a multisession with 8 workers, then the following two things might happen:
future
more than 8 times and none of them is resolved, the main session will get blockedMy question is, can I detach the 8 futures in step 1 and manually control them? In this case, I will control these nodes and future remove them from its list without sending shut-down signals, and return the connections?
I can solve the latter question using parallel
package. I can make clusters and manage them in private, without letting future
know about them. Just not sure if future
will accidentally shut these nodes down.
I see.
My question is, can I detach the 8 futures in step 1 and manually control them? In this case, I will control these nodes and future remove them from its list without sending shut-down signals, and return the connections? I can solve the latter question using parallel package. I can make clusters and manage them in private, without letting future know about them. Just not sure if future will accidentally shut these nodes down.
Correct, just set up a separate cluster, e.g.
cl <- parallel::makeCluster(8) ## ... or cl <- future::makeClusterPSOCK(8)
and use it however you want. As long as you do not tell future to use that cluster, or parts of it, e.g.
plan(cluster, workers = cl)
plan(cluster, workers = cl[1:3])
the future framework will not use or touch its workers. It will never shut them down. The only time future framework shuts down workers automatically is when it creates the cluster itself, e.g.
plan(multisession, workers = 8)
So, not when you use:
cl <- future::makeClusterPSOCK(8)
plan(cluster, workers = cl)
I see. So if I use your package like this, I'm actually managing the clusters cl
by myself without registering cl
to future plan, right? I still want to use future functions, they are so convenient. Just not sure if I'm doing something wrong.
cl <- future::makeClusterPSOCK(2)
f <- future::run(
future::MultisessionFuture({
Sys.getpid()
}, substitute = TRUE, local = TRUE, workers = cl, lazy = FALSE,
globals = TRUE, persistent = FALSE, gc = FALSE, earlySignal = FALSE,
label = 'Dedicated')
)
future::resolved(f)
res <- future::result(f)
res$value
Thanks :)
This might be a dumb question, but is there a way my functions can detect whether they are running in the slave nodes? They they can behave differently according to the situations?