HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
956 stars 83 forks source link

How to know if I'm in the child session #347

Open dipterix opened 4 years ago

dipterix commented 4 years ago

This might be a dumb question, but is there a way my functions can detect whether they are running in the slave nodes? They they can behave differently according to the situations?

HenrikBengtsson commented 4 years ago

This might be a dumb question, ...

You just broke Rule 1: There are no dumb questions. period. ;)

It's tempting to do something like:

> library(future)
> parent_pid <- Sys.getpid()
> f <- future({ is_child <- (Sys.getpid() != parent_pid); is_child })
> value(f)
[1] FALSE
> plan(multisession, workers = 2)
> f <- future({ is_child <- (Sys.getpid() != parent_pid); is_child })
> value(f)
[1] TRUE

Unfortunately, that's not reliable across all future backends, e.g. if the worker is running on another machine you might end up with identical PIDs by chance. When running in Docker containers on the same machine, you might see that all PIDs == 1.

A safer approach is something like:

> library(future)
> parent_uuid <- future:::session_uuid()
> f <- future({ is_child <- (future:::session_uuid() != parent_uuid); is_child })
> value(f)
[1] FALSE
> plan(multisession, workers = 2)
> f <- future({ is_child <- (future:::session_uuid() != parent_uuid); is_child })
> value(f)
[1] TRUE

Now, future:::session_uuid() is not part of the public API so use with great care, if at all; I won't promise it will remain. I started https://stat.ethz.ch/pipermail/r-devel/2019-May/077831.html in an attempted to get this into base R, because it believe it would be valuable outside of the future framework as well.

Having said all this, more importantly, you ideally want your future code to work regardless of what backend you use, i.e. it should work in sequential processing equally well as when parallelizing on machines on the other side of the world. So, the real answer to your question depends on what your real use case is. What are you trying to solve/avoid by knowing this?

dipterix commented 4 years ago

I'm using Sys.getpid() as temporary solution. As you have mentioned, this is not very stable/consistent. Right now I'm dev a package that requires intensive file IOs. The data requires one or more dedicated sessions to process, clean, and reshape data. The main session is non-blocked.

The catch is: if the function requiring intensive IOs is called in the main session, it will just push to a queue, letting idle sessions to handle them. If the function is called in slave nodes, then run normally.

I know in future package, all futures ran in slave sessions are sequential, but there are some parameter settings that are different and the easiest way is to have UUID.

Another question is, is there any suggestions to detach some nodes from future for dedicated uses? Let's say you plan a multisession with 8 workers, then the following two things might happen:

  1. When I call future more than 8 times and none of them is resolved, the main session will get blocked
  2. When I schedule another plan, these non-resolved futures will become unavailable.

My question is, can I detach the 8 futures in step 1 and manually control them? In this case, I will control these nodes and future remove them from its list without sending shut-down signals, and return the connections?

dipterix commented 4 years ago

I can solve the latter question using parallel package. I can make clusters and manage them in private, without letting future know about them. Just not sure if future will accidentally shut these nodes down.

HenrikBengtsson commented 4 years ago

I see.

My question is, can I detach the 8 futures in step 1 and manually control them? In this case, I will control these nodes and future remove them from its list without sending shut-down signals, and return the connections? I can solve the latter question using parallel package. I can make clusters and manage them in private, without letting future know about them. Just not sure if future will accidentally shut these nodes down.

Correct, just set up a separate cluster, e.g.

cl <- parallel::makeCluster(8)  ## ... or cl <- future::makeClusterPSOCK(8)

and use it however you want. As long as you do not tell future to use that cluster, or parts of it, e.g.

plan(cluster, workers = cl)
plan(cluster, workers = cl[1:3])

the future framework will not use or touch its workers. It will never shut them down. The only time future framework shuts down workers automatically is when it creates the cluster itself, e.g.

plan(multisession, workers = 8)

So, not when you use:

cl <- future::makeClusterPSOCK(8)
plan(cluster, workers = cl)
dipterix commented 4 years ago

I see. So if I use your package like this, I'm actually managing the clusters cl by myself without registering cl to future plan, right? I still want to use future functions, they are so convenient. Just not sure if I'm doing something wrong.

cl <- future::makeClusterPSOCK(2)
f <- future::run(
  future::MultisessionFuture({
    Sys.getpid()
  }, substitute = TRUE, local = TRUE, workers = cl, lazy = FALSE,
  globals = TRUE, persistent = FALSE, gc = FALSE, earlySignal = FALSE,
  label = 'Dedicated')
)

future::resolved(f)
res <- future::result(f)
res$value

Thanks :)