Add option to randomize order of execution

mllg commented 6 years ago

Please correct me if I am wrong, but with the current scheduling, jobs are chunked depending on their sequential order. However, you often have some structure in the input vectors, so that the expensive operations are either the first or the last iterations. In this case, all expensive operations are executed on the same worker while all other workers are idling.

It would be great to have an option to randomize the execution order which should increase the average case execution time in these situations.

HenrikBengtsson commented 6 years ago

Sounds reasonable. This feature/idea sounds vaguely familiar from some discussion in the past. What needs to be decided is the API - need to be careful not to bloat it as more and more features are added.

Maybe a future.order options that defaults to (effectively) seq_along(X), but could also take any index vector, future.order = "random", or a function, e.g. future.order = function(X) sample.int(length(X)) for maximum control.

HenrikBengtsson commented 6 years ago

An alternative is to avoid adding a new argument and give full chunking, including ordering, control via the existing future.scheduling argument, e.g. future.scheduling = function(X) { <randomize idxs <- seq_along(X) and partition into a list of nbrOfWorkers() chunks of equal size> }.

mllg commented 6 years ago

For my application a simple randomization would be sufficient. I'm not sure I understood your future.scheduling function. What is the input X here?

Btw: There are some simple scheduling helpers in batchtools: chunk, binpack and lpt. The later two expect weights (e.g., estimated runtimes) as input.

HenrikBengtsson commented 6 years ago

What is the input X here?

That would be the X in future_lapply(X, ...) such that you have the option to model the scheduling based on the input data in a single call.

mllg commented 6 years ago

That would be the X in future_lapply(X, ...) such that you have the option to model the scheduling based on the input data in a single call.

Okay, this would work. Regarding the choice of API: I guess randomization with future.scheduling would look rather complicated, but I like the flexibility OTOH. Your choice :smile:

HenrikBengtsson commented 6 years ago

Some additional attempts to reuse existing argument and thereby avoiding having to introduce yet-another argument:

future.scheduling = function(X) { ... return partitioning ... }
future.scheduling = structure(TRUE, order = "random")
future.scheduling = structure(TRUE, order = function(X) sample.int(length(X)))
future.scheduling = structure(2.0, order = "random")
future.chunk.size = structure(4, order = "random")

The above is in line with upcoming support for globals = structure(TRUE, add = more_globals) (https://github.com/HenrikBengtsson/future/issues/227).

Then one could also use formulas:

future.scheduling = TRUE ~ random
future.scheduling = TRUE ~ order:random (or?)
future.scheduling = 2.0 ~ random
future.chunk.size = 4 ~ random

Although it simplifies the notation for the user, I'm hesitant to jump on the formula approach because it requires lots of serious thoughts exactly what the notation should be - there are so many alternatives.

mllg commented 6 years ago

I'd definitely prefer the first approach using structure().

Have you considered to introduce a control object (in the fashion of passing a rpart.control-object to rpart())? A future.control-object could bundle all arguments. Would be good for the overview (also in the documentation) and you would avoid inconsistencies like https://github.com/HenrikBengtsson/future.apply/issues/26.

HenrikBengtsson commented 6 years ago

Have you considered to introduce a control object (in the fashion of passing a rpart.control-object to rpart())? A future.control-object could bundle all arguments. Would be good for the overview (also in the documentation) and you would avoid inconsistencies like https://github.com/HenrikBengtsson/future.apply/issues/26.

Moved this part to Issue #27

HenrikBengtsson commented 5 years ago

I've added the below to the develop version:

Control processing order of elements:

Attribute ordering of future.chunk.size or future.scheduling can be used to control the ordering the elements are iterated over, which only affects the processing order not the order values are returned. This attribute can take the following values:

index vector - an numeric vector of length length(X)
function - an function taking one argument which is called as ordering(length(X)) and which much return an index vector of length length(X), e.g. function(n) rev(seq_len(n)) for reverse ordering.
"random" - this will randomize the ordering via random index vector sample.int(length(X)).

For example,

future.chunk.size = structure(4L, ordering = rev(length(X))
future.chunk.size = structure(4L, ordering = function(nX) sample.int(nX))
future.scheduling = structure(TRUE, ordering = "random")

Note that I did not add support for function(X) ... - only function(nX) where nX = length(X). This was mainly done to keep it simple and consistent between future_lapply() and future_mapply(). You always get the ordering depending on the input data (not just the length) by passing a plain index vector or by having your ordering function depend on the input data via a global variable, e.g. function(nX) rev(seq_along(X)).

Install with:

remotes::install_github("HenrikBengtsson/future.apply@develop")

HenrikBengtsson commented 5 years ago

FYI, future 1.1.0 implementing the above is now on CRAN.

mllg commented 5 years ago

Thanks!

futureverse / future.apply