HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

Issues with foreach and SGE #686

Closed fkgruber closed 1 year ago

fkgruber commented 1 year ago

furrr works perfectly with future.batchtools. If you have a loop with 3 elements you get 3 jobs on the cluster:

library(furrr)
library(future.batchtools)
plan(batchtools_sge)
nothingness <- future_map(c(2, 2, 2), ~Sys.sleep(.x))
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  14375 0.50500 jobb856749 fred         r     06/03/2023 22:23:03 all.q@ip-xxxec2.inte     1        
  14376 0.50500 job2b8fc0d fred         r     06/03/2023 22:23:03 all.q@ip-1xxx.ec2.inte     1        
  14377 0.50500 jobfdf28f1 fred         r     06/03/2023 22:23:03 all.q@ip-xx.ec2.inte     1        

With foreach, however, I only get 1 job:

library(foreach)
library(future)
library(furrr)
library(future.batchtools)
library(doFuture)

mu <- 1.0
sigma <- 2.0
registerDoFuture()
plan(batchtools_sge)
x %<-% {
  foreach(i = 1:3) %dopar% {
    Sys.sleep(3)
    set.seed(123)
    rnorm(i, mean = mu, sd = sigma)
  }
}
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  14378 0.50500 job7472e4f fred         r     06/03/2023 22:26:03 all.q@ip-xxx.ec2.inte     1        

and when it return we get the following 2 strange warnings:

Warning messages: 1: executing %dopar% sequentially: no parallel backend registered 2: UNRELIABLE VALUE: Future ( ) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore". >

Why does it say there is no parallel backend registered when I'm running registerDoFuture()?

Alternatively, I tried %dofuture% instead of %dopar% but it still only generates 1 job.


x %<-% {
  foreach(i = 1:10) %dofuture% {
                           Sys.sleep(3)
                           set.seed(123)
                           rnorm(i, mean = mu, sd = sigma)
                         }
  }

f = futureOf(x)
resolved(f)
x
qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  14379 0.50500 job4b3cd22 fred         r     06/03/2023 22:28:48 all.q@ipxxx     1        

This time I only get the random number warning:

Warning message: UNRELIABLE VALUE: At least one of iterations 1-10 of the foreach() %dofuture% { … }, part of chunk #1 ( doFuture2-1 ), unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify foreach() argument '.options.future = list(seed = TRUE)'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 'doFuture.rng.onMisuse' to "ignore". >

I also tried other options in foreach like .options.future = list(scheduling=1) but they don't seem to have any effect.

Is it possible that foreach somehow chunks all the iteration in one task? Or is something else not working.

Thanks Fred

scottkosty commented 1 year ago

TLDR: I was hoping to save Henrik a bit of time but alas I failed.

Thanks for that nice minimal example.

Regarding the issue about the seed, you can read about the purpose of the warning in ?future under the seed argument. The warning even kindly suggests how to add it in your context, by giving the argument to specify to foreach(). However, when I tried that, I still got a warning.

Here is the code I ran, based on your code (note the addition of .options.future = list(seed = TRUE)):

library(foreach)
library(future)
library(furrr)
library(doFuture)

mu <- 1.0
sigma <- 2.0

plan(multisession)

x %<-% {
  foreach(i = 1:10, .options.future = list(seed = TRUE)) %dofuture% {
                           Sys.sleep(3)
                           set.seed(123)
                           rnorm(i, mean = mu, sd = sigma)
                         }
}

f = futureOf(x)
resolved(f)
x

However, I still get the following warning:

UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore". 
scottkosty commented 1 year ago

By the way, in your real example do you indeed want to set set.seed(123) inside the body, or was this just for debugging purposes? There are use cases (I have one) where you might want to do that, but I think in most cases you do not want to set the seed inside the body, especially after we figure out the seed argument issue.

HenrikBengtsson commented 1 year ago

I'm not at a computer right now, but you don't want to use %<-% here; just use a regular <- assignment.