Closed mllg closed 5 years ago
Sounds reasonable. This feature/idea sounds vaguely familiar from some discussion in the past. What needs to be decided is the API - need to be careful not to bloat it as more and more features are added.
Maybe a future.order
options that defaults to (effectively) seq_along(X)
, but could also take any index vector, future.order = "random"
, or a function, e.g. future.order = function(X) sample.int(length(X))
for maximum control.
An alternative is to avoid adding a new argument and give full chunking, including ordering, control via the existing future.scheduling
argument, e.g. future.scheduling = function(X) { <randomize idxs <- seq_along(X) and partition into a list of nbrOfWorkers() chunks of equal size> }
.
For my application a simple randomization would be sufficient. I'm not sure I understood your future.scheduling
function. What is the input X
here?
Btw: There are some simple scheduling helpers in batchtools
: chunk
, binpack
and lpt
. The later two expect weights
(e.g., estimated runtimes) as input.
What is the input
X
here?
That would be the X
in future_lapply(X, ...)
such that you have the option to model the scheduling based on the input data in a single call.
That would be the
X
infuture_lapply(X, ...)
such that you have the option to model the scheduling based on the input data in a single call.
Okay, this would work. Regarding the choice of API:
I guess randomization with future.scheduling
would look rather complicated, but I like the flexibility OTOH. Your choice :smile:
Some additional attempts to reuse existing argument and thereby avoiding having to introduce yet-another argument:
future.scheduling = function(X) { ... return partitioning ... }
future.scheduling = structure(TRUE, order = "random")
future.scheduling = structure(TRUE, order = function(X) sample.int(length(X)))
future.scheduling = structure(2.0, order = "random")
future.chunk.size = structure(4, order = "random")
The above is in line with upcoming support for globals = structure(TRUE, add = more_globals)
(https://github.com/HenrikBengtsson/future/issues/227).
Then one could also use formulas:
future.scheduling = TRUE ~ random
future.scheduling = TRUE ~ order:random
(or?)future.scheduling = 2.0 ~ random
future.chunk.size = 4 ~ random
Although it simplifies the notation for the user, I'm hesitant to jump on the formula approach because it requires lots of serious thoughts exactly what the notation should be - there are so many alternatives.
I'd definitely prefer the first approach using structure()
.
Have you considered to introduce a control object (in the fashion of passing a rpart.control
-object to rpart()
)? A future.control
-object could bundle all arguments. Would be good for the overview (also in the documentation) and you would avoid inconsistencies like https://github.com/HenrikBengtsson/future.apply/issues/26.
Have you considered to introduce a control object (in the fashion of passing a
rpart.control
-object torpart()
)? Afuture.control
-object could bundle all arguments. Would be good for the overview (also in the documentation) and you would avoid inconsistencies like https://github.com/HenrikBengtsson/future.apply/issues/26.
Moved this part to Issue #27
I've added the below to the develop version:
Attribute ordering
of future.chunk.size
or future.scheduling
can be used to control the ordering the elements are iterated over, which only affects the processing order not the order values are returned. This attribute can take the following values:
length(X)
ordering(length(X))
and which much return an index vector of length length(X)
, e.g. function(n) rev(seq_len(n))
for reverse ordering."random"
- this will randomize the ordering via random index vector sample.int(length(X))
.For example,
future.chunk.size = structure(4L, ordering = rev(length(X))
future.chunk.size = structure(4L, ordering = function(nX) sample.int(nX))
future.scheduling = structure(TRUE, ordering = "random")
Note that I did not add support for function(X) ...
- only function(nX)
where nX = length(X)
. This was mainly done to keep it simple and consistent between future_lapply()
and future_mapply()
. You always get the ordering depending on the input data (not just the length) by passing a plain index vector or by having your ordering function depend on the input data via a global variable, e.g. function(nX) rev(seq_along(X))
.
Install with:
remotes::install_github("HenrikBengtsson/future.apply@develop")
FYI, future 1.1.0 implementing the above is now on CRAN.
Thanks!
Please correct me if I am wrong, but with the current scheduling, jobs are chunked depending on their sequential order. However, you often have some structure in the input vectors, so that the expensive operations are either the first or the last iterations. In this case, all expensive operations are executed on the same worker while all other workers are idling.
It would be great to have an option to randomize the execution order which should increase the average case execution time in these situations.