HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

Why is doFuture/multicore have different behaviour than doMC #39

Closed statquant closed 5 years ago

statquant commented 5 years ago

Hello, sorry for the noob question but I can't figure out why the following

library(doFuture)                         
registerDoFuture()                        
plan(multicore)                           
tmp <- foreach(i = 1:100) %dopar% print(i)

seems to print 1:100 in order while

library(foreach)                          
library(doMC)                             
registerDoMC()                            
tmp <- foreach(i = 1:100) %dopar% print(i)

do not That makes me doubt forking is even happening...

HenrikBengtsson commented 5 years ago

Not a bad question at all. But, both approaches do indeed make use of the parallel::mclapply() framework internally. Here is how you can convince yourself that different R workers are in use without using print():

> library(doMC)                             
> registerDoMC()                            
> pids <- foreach(i = 1:10) %dopar% Sys.getpid()
> unlist(pids)
 [1] 28923 28924 28925 28926 28923 28924 28925 28926 28923 28924
> Sys.getpid()
[1] 28145

so defintely different process IDs.

So, it's all about how output is seen or not. When you use parallel::mcapply(), or doMC as here, it depends on your R environment whether output (e.g. from print(), cat(), ...) done on multicore workers (=forked R processes) will be visible to you or not. For instance, on Linux in a plain terminal, you'll actually see the output., e.g.

> tmp <- parallel::mclapply(1:2, print)
[1] 1
[1] 2

However, calling the same in the RStudio Console on the same Linux machine and R version, will show nothing.

So, why? I'd say, we're basically lucky to get output in the plain terminal. So, I wouldn't count on it and certainly not expect it.

... now to the future framework. The above used to be the case there as well, but since future 1.9.0 (2018-07-22), the future framework will capture all output internally and relay it back to you automatically. This works the same regardless of parallel backend you use. It will also work the same everything, include the RStudio Console, e.g.

> library(future.apply)
> plan(multicore)
> tmp <- future_lapply(1:2, print)
[1] 1
[1] 2
>
statquant commented 5 years ago

ok, this is understood. Can I ask why you took decision to 'retain' the output on stdout and only deliver it once workers are done ? I might be missing important consideration but I really like being able to see what happens on the workers when it happens.

statquant commented 5 years ago

I also have another question regarding best practice when using

way 1: splitting the big data in the hope that only part of it is sent to the workers

DT_split = split(DT, by = 'date')
foreach(DT_i = split_list) %dopar% some_func(DT_i) 

way 2: not splitting

dates = DT[, unique(date)]
foreach(date_i = dates) %dopar% some_func(DT[date == date_i]) 

Am I right that

HenrikBengtsson commented 5 years ago

Can I ask why you took decision to 'retain' the output on stdout and only deliver it once workers are done ?

The reason is that you cannot reliably relay stdout in a live fashion. That's what I hoped your original problem and my illustration above conveyed. .

... only deliver it once workers are done ?

Technically, it's relayed as soon as possible once a future is resolved. Also, since we want futures to produce the same output regardless of parallel backend used, stdout is retained until all output from preceeding futures are collected and outputted. Otherwise, we'd output things out of order.

HenrikBengtsson commented 5 years ago

Yes, splitting up data outside of futures/parallel iterations should lower the amount of data exported to workers. Correct, in "way 2", you're exporting all of DT to each future worker.

Am I right that

  • in multicore case both are equivalent because DT is "shared" across all workers ?

In the ideal case, yes. Forked processes, which multicore uses, helps lower the memory load. It's something taken care of solely by the OS, not R or parallel backends. However, there are "howevers". R's garbage collector (GC) may kick in in one or more of the forked child processes. When done, that will deallocate and eventually overwrite memory in some of the children. When this happens, the shared memory space among the children can no longer be shared, and the OS will start creating copies. So, it's not all good. I don't if anyone done are careful study of this, so I don't have any pointers. What I know is that you cannot disable R's GC, so you just have to cross your finers.

Then there are bigger issues such as it's not safe to run multi-threaded processing inside forked processes. For instance, I don't know if data.table, which use multi-threaded processing, is safe to use with multicore. I tend to suggest multisession more and more these days if you wanna be on the safe side.

  • in slurm case way 1 should be better because less data is sent to the workers ?

Correct. Same for multisession and several other backends.

FYI, it's on the to-do list to have futures gather time and memory benchmarks (https://github.com/HenrikBengtsson/future/issues/59) to make it easier for developers to profile different strategies like yours.

statquant commented 5 years ago

many thanks for all the explanations