futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
954 stars 85 forks source link

Worker selection #301

Open Enchufa2 opened 5 years ago

Enchufa2 commented 5 years ago

AFAIK, this is not possible right now: it would be great to be able to specify a particular worker for a particular future. For instance, something like

f <- future(some_computation, worker=worker_id)

Consider, e.g., an extension of the use case presented in #300, in which we are exploiting an online API and we want to set up n workers to handle n distinct API methods and their corresponding rate limitations.

HenrikBengtsson commented 5 years ago

Related to my https://github.com/HenrikBengtsson/future/issues/300#issuecomment-483839340 comment, and somewhat rethorical/clarification counter questions: How should:

f <- future(some_computation, worker=worker_id)

behave when plan(sequential), plan(remote, worker = "my_remote_machine_on_the_north_pole"), or plan(future.batchtools::batchtools_sge) is set?

The idea of using futures is not to make assumption about specific workers and where a future expression is resolved*. It sounds more like you want to use the parallel PSOCK API, e.g.

library(parallel)
cl <- makeCluster(3L)
worker_id <- 2L
res <- clusterEvalQ(cl[worker_id], some_computation)

You can also build a wrapper on top of the future framework, e.g. something like:

future_on_worker <- function(..., cluster_node) {
  oplan <- plan("list")
  on.exit(plan(oplan))
  plan(cluster, worker = cluster_node)
  future(...)
}

cl <- makeCluster(3L)
f <- future_on_worker(some_computation, cluster_node = cl[worker_id])

(*) There are ideas about being able to specify constraints/resources that a worker needs to support in order for it to evaluate the future expression, see for instance https://github.com/HenrikBengtsson/future/issues/181

Enchufa2 commented 5 years ago

Related to my #300 (comment) comment, and somewhat rethorical/clarification counter questions: How should:

f <- future(some_computation, worker=worker_id)

behave when plan(sequential), plan(remote, worker = "my_remote_machine_on_the_north_pole"), or plan(future.batchtools::batchtools_sge) is set?

Just ignore it, as plan ignores workers=1? :) I really didn't think a lot about the interface. Maybe this could be achieved via the evaluator function?

The idea of using futures is not to make assumption about specific workers and where a future expression is resolved*. It sounds more like you want to use the parallel PSOCK API, e.g.

library(parallel)
cl <- makeCluster(3L)
worker_id <- 2L
res <- clusterEvalQ(cl[worker_id], some_computation)

That call blocks, so it's not what I'm looking for.

(*) There are ideas about being able to specify constraints/resources that a worker needs to support in order for it to evaluate the future expression, see for instance #181

One could say the same for your proposal there: what happens if plan(sequential) and your local machine does not fulfil the requirements? You could see this as an extension of such proposal: the requirement is that there is a worker and it has a property (some ID or index or whatever).

HenrikBengtsson commented 5 years ago

(oops, I somehow started to reply by editing your comment; I hope I undid everything).

Just ignore it, as plan ignores workers=1? :) I really didn't think a lot about the interface. Maybe this could be achieved via the evaluator function?

Yeah, but the plan() function is "special" and should be considered to belong to the end user and should not really be set by the developer (... if done, it should be done atomically such that when the function returns the set plan(s) should be identical to what they where before).

Please don't use the evaluator argument - it's an internal argument. (I should really hide this argument and ideally make it an error if used - I've seen others mention it too in some discussions).

One could say the same for your proposal there: what happens if plan(sequential) and your local machine does not fulfil the requirements? You could see this as an extension of such proposal: the requirement is that there is a worker and it has a property (some ID or index or whatever).

Exactly. Some type of well-defined generic API allowing the future frontend and the future backend to communicate needs could probably cover this use case. It could be that your local machine does not support the future's needs, e.g. f <- future(plot_image(...), resource = list(required_file("very/remote/telescope/huge/raw_data/black_hole.hdf5"))). In addition to Issue #181, this is only mentioned in #172. This is not an easy problem attack, so this one will take quite a while to solve - the initial goal is to identify the absolute minimum core API/protocol for this.

Enchufa2 commented 5 years ago

I was thinking... Another angle for this issue, that you may be more comfortable with, is that a future may have a sequential identifier or batch identifier, that is: all the futures with the same identifier are guaranteed to be executed sequentially (because they have a critical section, as in my case, or because they use too much memory). If plan(sequential), nothing needs to be done. Otherwise, it means that futures with the same identifier need to be enqueued for the same worker.

Advantages: it's clear what happens idependently of the backend (it's a property of the future, and no assumption needs to be done about the backend), and it's much easier than, and orthogonal to, the constraints/resources API.

Enchufa2 commented 5 years ago

@HenrikBengtsson I was wondering what's your take on this reformulation. We may want to change the issue's title and aim.