futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
955 stars 85 forks source link

Different plan()s for different futures #181

Open wlandau-lilly opened 6 years ago

wlandau-lilly commented 6 years ago

Related: wlandau-lilly/drake#169. It would be amazing if a single call to future_lapply() could distribute simultaneous futures over a list of alternative pre-built plan()s. I am not quite sure about the interface, but I can picture how 5 futures might run on a local machine and another 5 might simultaneously go to SLURM.

library(future.batchtools)
plans <- list(plan1 = plan(multicore), plan2 = plan(batchtools_slurm(...)))
plan_map <- rep(c("plan1", "plan2"), each = 5)
future_lapply(
  X = 1:10,
  FUN = sqrt,
  plans = plans,
  plan_map = plan_map
)

It may seem silly to juggle plans in a single call to future_lapply(), but it would be a huge help for drake.

HenrikBengtsson commented 6 years ago

This is an interesting idea. If I understand you correctly, you're looking for a "super" backend where we can throw in all your available future backends and treat it as one big pool of compute resources. I'm happy that you've allowed yourself to even consider the possibility of such a setup - I take it as the Future API has lots of potentials and many yet to be discovered :)

This is somewhat related to (non-official) ideas I have where the type of future to be used is not fixed when the Future object is created, but when it is launched. If that would be in place, one could imagine initiating a set of (lazy) futures that are ready to be launched. Only at the time of launch, the plan() settings come in play. Then it should also be possible to switch between different backends when a future is launched one after the other.

The closest we get to this today is that of a parallel-package cluster (plan(cluster, ...)) which I think could consist of a heterogeneous set of local and remote workers of different types (e.g. PSOCK, FORK, ...) - though I don't think many have used them that way. A poor man's version of the above could be to leverage such clusters. The idea would be to create another type of cluster node type, e.g. batchtools_node. Such cluster nodes could even be used in calls such as parallel::parLapply(cl, ...) calls. ... and soon we're about to reinvent the original train-of-thoughts behind the Future API. A bit inception-ish.

wlandau-lilly commented 6 years ago

Yes, that is the gist. I would like to have several simultaneous plans and send any future to any plan in any order without duplicated overhead. I am a bit concerned about lumping all the plans together in a single overarching batchtools_node() plan or something similar because I want to micromanage which futures go to which plans.

What about evaluators? What is the relationship between evaluators and plans? I tried going around future_lapply(), but it is not quite working. In development going forward, what would need to happen to extend future_lapply() this way?

HenrikBengtsson commented 6 years ago

I'm quite swamped now so I've unfortunately don't have much time to dive into your code, but is you're goal to be able to distribute work/tasks to different types of compute resources? For instance, some tasks (=futures) you'd like to run on a local machine, some on high-memory machines, and others on a small set of machines that have a certain NFS folder mounted? If so, I'm considering an Extended Future API (https://github.com/HenrikBengtsson/future/issues/172) that will support optional and/or mandatory resource requests in some standardized fashion, e.g.

f <- future({ ... }, requires = c("mount:/data/folder/", "R (>= 3.3.0)"))

and then there will be a generic underlying framework that will make sure that future will be launched on a backend worker that meets those requirements. Obviously, there's lots of work to get there. Before getting there, I am prioritizing formalizing the Core Future API and provide a generic conformance test framework such that any/all future backends can be validated against this Core Future API. I anticipate that this work will help define and explore what the Extended Future API could look like.

About future.lapply(): Any improvements will be done in a new future.apply package (Issue #159), because it is actually above and beyond what the future package should provide (which is the core Future API). I'm not sure when I'll have time to launch future.apply. It might be that what you're looking for fits in such a package, but it might also be that it's something different, just as the foreach and BiocParallel framework differs from future.apply yet being related.

wlandau-lilly commented 6 years ago

Distributing different workers/tasks to different types of computer resources is exactly what I am looking for. I had hoped to accomplish this by assigning different evaluators to different futures. Whether through future_lapply() or not, is this possible given the current state of the future package, or is it something that needs to wait until the possible Extended Future API? Just knowing that much will help me in the short term.

wlandau commented 6 years ago

Update: going forward, I think I will be more focused on individual futures than future_lapply(). I think I will supply non-default values to the evaluator argument of future(). Have you known many people to deploy futures this way? Should the evaluator always be an output object from plan(...)?