HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
956 stars 83 forks source link

About single-worker equivalence to sequential #300

Closed Enchufa2 closed 3 years ago

Enchufa2 commented 5 years ago

I'm not sure whether this is a design decision or a limitation, but I find it quite odd. Basically, workers=1 does not setup any worker, and workers=2 sets up 2 workers. This is, at the very least, inconsistent and confusing.

If there is no underlying limitation, I would like to propose a change, because I think there are important use cases for having a single worker (apart from the main process).

Consider an online API with rate limitation. Say that we are downloading information using methods A and B from that API, with different rate limits, so that A provides some info to use with B. The most straightforward way to implement this would be to manage method A in the main process, taking into account rate limits for A, and send jobs to a single worker that would manage method B and its rate limitations. Note that, with several workers, there may be race conditions, and things gets quite more complicated to avoid crashes.

renkun-ken commented 5 years ago

I think this is odd too. My use case is that the master process is doing heavy computation and also generating future tasks to run in parallel. The tasks may take a lot of memory so that a very limited number of parallel tasks can be running at the same time. If the server load is high, 3 workers apart from the master process may be reduced to 1 worker apart from the master. But under such design choice, it seems that it's not viable to allow only 1 worker running in parallel.

HenrikBengtsson commented 5 years ago

The fallback to sequential for multicore and multisession futures when workers = 1, stems from how parallel::mclapply(..., mc.cores = 1). The rationale is also, which I think is the rationale behind mclapply() behavior too, that you will get the same result with less overhead if you evaluate an R expression is the main R process rather than spawning it off in a single background R process.

One could imagine an option where we would allow a single background worker, e.g. plan(multicore, workers = 1L, fallback_to_sequential=FALSE) [couldn't be bother do come up with a better name right now].

Note that, for multisession futures, you can already achieve this by using:

plan(cluster, workers = rep("localhost", times = 1L))

because multisession is essentially a set of "localhost" 'cluster' workers.

Now, to the complicated part. Your implementation/usage of futures are making assumptions on futures always being resolved asynchronously, e.g. would your code work if plan(sequential) is set? The objective of the Future API is that code should be able to run anywhere regardless of future backend that fullfills the Future API. In other words, it's against the design philosophy to implement future code that only works on some backends. (I understand that this might happen anyway because there are lacking features in the Future API or it's just easier to use futures than other parallel backends, but I recommend to really try hard to make sure code works across all backends).

The current Future API does not provide promises about asynchronous processing across all possible (and future existing) future backends. There was a discussion in Issue #109 (Eager / lazy futures and synchronous / asynchronous futures) about this. We concluded that the decision on whether a future should be resolved eagerily or lazily is something that the developer should be in control over and that the end user can not control - if they would be able too, then some code might fail. It could be that there is a need for asynchronous = TRUE as well (your use cases may suggest this), whereas we currently has something like asynchronous = NA corresponding to "whatever the backend decides". Adding a asynchronous=TRUE/FALSE/NA property to the Future API is potentially a major design change that needs to be considered/studied very carefully before making a decision. Feel free to continue this discussion in Issue #109.

Enchufa2 commented 5 years ago

Thanks for your detailed reply. Several comments:

The fallback to sequential for multicore and multisession futures when workers = 1, stems from how parallel::mclapply(..., mc.cores = 1). The rationale is also, which I think is the rationale behind mclapply() behavior too, that you will get the same result with less overhead if you evaluate an R expression is the main R process rather than spawning it off in a single background R process.

I understand the rationale behind mclapply, because it blocks until the computation is done (and so the user doesn't notice, and doesn't care, whether there was a separate worker or not). But I think the rationale should be different, and in accordance to intuition, in a non-blocking case.

Note that, for multisession futures, you can already achieve this by using:

plan(cluster, workers = rep("localhost", times = 1L))

because multisession is essentially a set of "localhost" 'cluster' workers.

That's great, thanks, I didn't know that.

Now, to the complicated part. Your implementation/usage of futures are making assumptions on futures always being resolved asynchronously, e.g. would your code work if plan(sequential) is set?

My usage does not make such assumption. In fact, it is working sequentially right know because I didn't know how to setup a single multisession worker.

The current Future API does not provide promises about asynchronous processing across all possible (and future existing) future backends. [...] Feel free to continue this discussion in Issue #109.

Thanks for the pointer, I'll take a look.

HenrikBengtsson commented 4 years ago

In case someone runs into this thread. It is possible to create a single background worker by using:

plan(cluster, workers = 1)

which is equivalent to:

plan(cluster, workers = "localhost")

For example,

> library(future)
> plan(cluster, workers = 1)
> nbrOfWorkers()
[1] 1

> Sys.getpid()
[1] 20683

> value(future(Sys.getpid()))
[1] 20802
> value(future(Sys.getpid()))
[1] 20802
HenrikBengtsson commented 3 years ago

Closing given that my most recent comment gives a solution. If this is not sufficient, please comment.

HenrikBengtsson commented 1 year ago

Another update: Since future 1.27.0 (2022-07-22), one can force single workers for 'multicore' and 'multisession' using workers = I(1), e.g.

plan(multicore, workers = I(1))
plan(multisession, workers = I(1))