(This issue was triggered by the discussion in Issue #6)
Currently the doFuture package doesn't try to chunk up the calculations. It was mainly designed to allow to run foreach iterations on a compute cluster where each iteration is a fairly long-run running taks. However, it's been on my list to look into chunking too, because I suspect, but I don't know for sure, that the foreach and / or iterators package provides automatic chunking that various doNnn backends can make use of.
Basically, for the current doFuture implementation, with 100 iterations and 4 background workers, the backend will communicate with the four workers 100 times using 100 independent futures. A much more clever approach would be to only use 4 futures so that each worker does 25 iterations each in one go.
If we use a multicore backend (which uses forked processes and not a set of fixed PSOCK cluster processes as workers), we can show that doFuture will use 100 unique processes as follows:
whereas if we do the same with doParallel (or doMC I guess), we get:
> library("doParallel")
> registerDoParallel(cores = 4)
> y <- foreach(icount(100), .combine = c) %dopar% { Sys.getpid() }
> str(unique(y))
int [1:4] 18016 18017 18018 18019
(if you're on Windows I think registerDoParallel(cores = 4) will fall back to registerDoParallel(parallel::makeCluster(4)) which is a different thing).
Since forking processes can be quite expensive, we'll see a dramatic time difference between the two approaches, we're doFuture is painfully slow.
So, I'm pretty sure lack of chunking in doFuture is the key reason for the performance difference using doFuture and doParallel.
Action
The doFuture package is currently a very light and generic foreach wrapper for the Future API. It literally consists of 60 lines of plain R code.
Look into how automatic chunking is done in the foreach framework, assuming it exists.
Implement support for automatic chunking.
When doing compute cluster processing of large tasks, we should probably not chunk things up. This basically corresponds to having an infinite number of workers.
The latter point reminds me that I did ask the author of foreach what foreach::getDoParWorkers() should return for endless number of workers. It turns out that I even prepared for this, e.g.
(This issue was triggered by the discussion in Issue #6)
Currently the doFuture package doesn't try to chunk up the calculations. It was mainly designed to allow to run foreach iterations on a compute cluster where each iteration is a fairly long-run running taks. However, it's been on my list to look into chunking too, because I suspect, but I don't know for sure, that the foreach and / or iterators package provides automatic chunking that various doNnn backends can make use of.
Basically, for the current doFuture implementation, with 100 iterations and 4 background workers, the backend will communicate with the four workers 100 times using 100 independent futures. A much more clever approach would be to only use 4 futures so that each worker does 25 iterations each in one go.
If we use a multicore backend (which uses forked processes and not a set of fixed PSOCK cluster processes as workers), we can show that doFuture will use 100 unique processes as follows:
whereas if we do the same with doParallel (or doMC I guess), we get:
(if you're on Windows I think
registerDoParallel(cores = 4)
will fall back toregisterDoParallel(parallel::makeCluster(4))
which is a different thing).Since forking processes can be quite expensive, we'll see a dramatic time difference between the two approaches, we're doFuture is painfully slow.
So, I'm pretty sure lack of chunking in doFuture is the key reason for the performance difference using doFuture and doParallel.
Action
The doFuture package is currently a very light and generic foreach wrapper for the Future API. It literally consists of 60 lines of plain R code.
The latter point reminds me that I did ask the author of foreach what
foreach::getDoParWorkers()
should return for endless number of workers. It turns out that I even prepared for this, e.g.