HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

global data.frame is not updated #29

Closed lsaravia closed 5 years ago

lsaravia commented 5 years ago

Maybe it is my mistake but when I try to update a global data.frame the data.frame is not updated

library("doFuture")
registerDoFuture()
plan(multisession)
x <- data.frame(a=1:5,b=NA,c=NA)
foreach(i = 1:nrow(x)) %dopar% {
  x$b[i] <- runif(1)
  x$c[i] <- runif(1)
}
x
HenrikBengtsson commented 5 years ago

Hi. This is a FAQ when it comes to foreach and any other parallel processing frameworks, e.g. https://stackoverflow.com/questions/18763376/global-assignment-parallelism-and-foreach. It is also independent of the parallel backend you use and you get the same with, say, doParallel.

The reason for this not working is that the parallel code (what's inside the foreach { ... } expression) is evaluated in a separate R process that the main R session from where it is called. It is not possible for other R processes on your machine to updated the variables in your main R session. An analogue is when you run two separate R sessions manually and you do x <- 42 in one of them - then you wouldn't expected x to be assigned in the other R session. That's how parallel processing in R works too. Instead, all values must be "returned" at the end of a parallel evaluation.

You're not the first one thinking foreach() works like a regular for loop - it doesn't! It's an unfortunate name. However, foreach() attempts to emulate a for loop, and it is (unfortunately) really good at it when you use foreach() %do% { ... } or foreach() %dopar% { ... } with registerDoSEQ(), so when you test your code with that it "seems to work". But as soon as you move to true parallel processing, things falls apart. Instead, treat foreach() as you treat lapply(). If you do, then you're example is effectively equal to:

lapply(1:nrow(x), FUN = function(i) {
  x$b[i] <- runif(1)
  x$c[i] <- runif(1)
}

and in that case you would expect x to be update, correct? Instead, you should always "return" values. That is, make sure all of your foreach() calls are assigned to a variable, cf. y <- lapply(...). So, in your case you can to do something like:

x <- data.frame(a=1:5,b=NA,c=NA)
y <- foreach(i = 1:nrow(x), .combine = rbind) %dopar% {
  data.frame(b = runif(1), c = runif(1))
}
x$b <- y$b
x$c <- y$c

Hope this helps