Add the load balancer for evenly distributing the tasks across workers

Jiefei-Wang commented 2 years ago

This pull request enables the load balancer in the apply function.

There are three build-in balancers for bplapply, namely "sequential", "stepwise", and "random". the sequential balancer is the balancer used in the master branch. However, I changed the default balancer to the stepwise balancer in this branch.

The stepwise balancer sends the 1st element of X to the 1st worker, 2nd to the 2nd worker, and so on down to the last worker. Then it started again, sending the next element of X to the 1st worker and so on. The cost of the stepwise balancer is marginal and the performance is better than the sequential balancer. Here is an example

library(BiocParallel)
p <- MulticoreParam(2)

## The balancer in the master branch
opt <- bpoptions(lapplyBalancer = "sequential")
system.time(
    bplapply(1:4, function(x)Sys.sleep(x), BPPARAM = p, BPOPTIONS = opt)
)
#   user  system elapsed 
#  0.102   0.033   7.039 

## stepwise balancer
opt <- bpoptions(lapplyBalancer = "stepwise")
system.time(
    bplapply(1:4, function(x)Sys.sleep(x), BPPARAM = p, BPOPTIONS = opt)
)
#   user  system elapsed 
#  0.117   0.003   6.078 

## send the elements of `X` to a random worker
opt <- bpoptions(lapplyBalancer = "random")
system.time(
    bplapply(1:4, function(x)Sys.sleep(x), BPPARAM = p, BPOPTIONS = opt)
)
#   user  system elapsed 
#  0.059   0.013   6.023

DarwinAwardWinner commented 2 years ago

Can you give some more examples of how the different balancers work? Suppose we have 10 tasks numbered 1 through 10, and there are 3 workers, labeled A, B, and C (deliberately chosen to have a non-integer ratio). Can you show how the tasks will be assigned by sequential and stepwise in these cases? For random, does the balancer ensure that an approximately equal number of tasks are sent to each worker, or does it randomly select a worker for each task independently of other tasks?

Jiefei-Wang commented 2 years ago

Sure, for the sequential balancer, the task dispatching plan is

A: 1, 2, 3
B: 4, 5, 6
C: 7, 8, 9, 10

For the stepwise balancer, it is

A: 1, 4, 7, 10
B: 2, 5, 8
C: 3, 6, 9

The random balancer will randomly create three sets of tasks, with the cardinality 3, 3, and 4 respectively.

mtmorgan commented 2 years ago

The stepwise balancer performs well in this circumstance because of how the computation scales with task number. But doesn't the 'random' balancer have lower expected evaluation time, in as much as we don't know the distribution of task evaluation times?

Jiefei-Wang commented 2 years ago

Yes, the random balancers have the lowest expected evaluation time, but the highest variance(when you redo the same apply function many times).

If we do not know the task evaluation times in advance, the performance of the stepwise and random balancers should be comparable in most cases. It is more like a tradeoff between expectation and variance. I'm not a fan of randomization, so I choose the stepwise balancer as the default balancer, but I keep the random balancer as an option here just in case the user knows the stepwise balancer will suffer in his apply function.

DarwinAwardWinner commented 2 years ago

A couple of points:

Do the balancers only work for bplapply? If they're available for other functions, then I think the current name of the option is misleading and should be changed.
Unless exact backward compatibility is desired for the sequential balancer, it would probably be better to dispatch more tasks to the first workers, e.g.:
```
A: 1, 2, 3, 4
B: 5, 6, 7
C: 8, 9, 10
```
This is because the first worker is usually the one that starts running at the earliest time, so giving it more tasks will tend to make everything finish slightly faster on average. As for the stepwise balancer, it appears that it already has this behavior. For the random balancer, you could sort the task lists in descending size order before dispatching the workers in order to get this behavior.
How easy is it to implement new balancers? Because I have a case where each worker takes so long to spin up relative to the runtime of each job that I actually want to give the first workers much longer lists of jobs than the last ones, e.g. something like:
```
A: 1, 2, 3, 4, 5, 6
B: 7, 8, 9
C: 10
```
This would get me closest to having all three workers finish at the same time, this minimizing total walltime for the computation. I realize this is a somewhat rare case, so I don't know if I would want to ask for this to be implemented in the package, but it would be nice if I could write my own custom balancer by e.g. providing a custom partitioning function or writing a class.

mtmorgan commented 2 years ago

Thanks.

I suppose that if the distribution of task evaluation times is independent of task order, then really any balancer has the same expected time?

I suppose (??) that the next most likely is that task evaluation times are ordered (from low to high, or high to low), perhaps not intentionally? And then what is the optimal evaluation order? If I had seven tasks 1:7 taking 1:7 seconds, and 4 workers, then I would like to assign worker:task as 1:7, 2:1, 6; 3: 2, 5; 4: 3:4 would be optimal. But I don't think any of the balancer satisfy that?

DarwinAwardWinner commented 2 years ago

In my point 3 above, I'm assuming each task takes an equal amount of time. The reason for wanting to assign different numbers of tasks to each worker is that each worker takes so long to get started that by the time worker C has started, worker A has already been running long enough to run 4 or 5 tasks. For example, imagine that starting a worker takes 1 minute and each task takes 30 seconds to run.

Jiefei-Wang commented 2 years ago

Hello @DarwinAwardWinner , for your comments

The balancer is designed for bplapply and bpiterate. Since the other apply functions depend on these two functions, they will also inherit the balancer from bplapply or bpiterate.
I think this is a good point. Backward compatibility of the balancer is not a major concern. I can implement it and make a force push.
It is not very hard to implement a new balancer and insert it into bplapply. I will just briefly introduce the balancer for the bplapply here and you can find all implementations in R/balancer.R. The balancer requires a generator function. It takes the number of elements of X and BPPARAM as input and return a list of two functions record and next_task. Here is a scratch for the balancer

mybalancer <- function(n, BPPARAM) {
  list(
    record = function(node, task_id, time) {
      ## record the task execution time
      ## can be empty
    },
    next_task = function() {
      ## return the next task
      ## task_id: an integer id used to identify the task
      ## index: the index of the vector `X`
      list(
        task_id = task_id,
        index = index
      )
    }
  )
}

During the parallel evaluation, the function next_task will be called to determine the task for each worker. For example, if we have X=runif(10) and 3 workers, the first call to next_task might return list(task_id=1, index=1:4). The second returns list(task_id=2, index=5:7) and the last returns list(task_id=3, index=8:10). This will give the desired load balancing for your point 2.

Once you have defined mybalancer, you can set the balancer via bpoptions, for example

opt <- bpoptions(lapplyBalancer = mybalancer)
bplapply(1:4, function(x)Sys.sleep(x), BPPARAM = p, BPOPTIONS = opt)

This can let bplapply use your customized balancer instead of its build-in balancer.

Jiefei-Wang commented 2 years ago

For @mtmorgan 's comment, I think if we know the task evaluation time in advance, we can provide a customized balancer to reach the optimal performance. It is not very hard to implement it. I plan to add a vignette to give a formal introduction to the balancer along with the other advanced features we have added recently.

Jiefei-Wang commented 2 years ago

I made a mistake in my previous comment. If we have 10 tasks and 3 workers, the actual task sizes are 4, 4, 2. I think this is better than 3,3,4 as the former makes all workers to do more tasks and the latter only give more tasks to one worker(Imagine we have 109 tasks and 10 workers, one worker will have 19 tasks). There is no need to update the commit.

Jiefei-Wang commented 2 years ago

Hello Martin, I wonder if you can merge this pull request. It looks like we have some new feature requests these days.

Bioconductor / BiocParallel

Add the load balancer for evenly distributing the tasks across workers #211