HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
957 stars 84 forks source link

Discussion: Forcing parallel reproducibility via the default RNG kind? #354

Open pat-s opened 4 years ago

pat-s commented 4 years ago

(This issue should serve as a public place for the discussion @HenrikBengtsson and I had via mail recently)

I wonder if it makes sense to support reproducible parallel streams via the default RNG kind "Mersenne-Twister" within a package to help users who are not aware that this RNG kind does not provide reproducible streams in parallel.

(I am talking about the standard parallel backends in R and not specifically about the way one can do this via the {future} package.)

Multicore backend

   old.seed = .Random.seed 
   seed = sample(1:100000, 1) 
   # we need to reset the seed first in case the user supplied a seed, 
   # otherwise "L'Ecuyer-CMRG" won't be used 
   rm(.Random.seed, envir = globalenv()) 
   set.seed(seed, "L'Ecuyer-CMRG") 

If the user uses set.seed(<number>) and goes parallel via the multicore backend, the code above will ensure parallel RNG streams.

If you want to see in action, https://github.com/mlr-org/parallelMap/pull/80 has some tests to ensure the correct functioning.

Socket backend

Here, one can do

clusterSetRNGStream(cl, iseed = sample(1:100000, 1)) 

to support the default RNG kind in parallel scenarios.

General

Whenever doing this, I wonder if one should at least tell the user that this was done behind the scenes to make them aware of whats happening (including eventual decreases in speed).

HenrikBengtsson commented 4 years ago

(I am talking about the standard parallel backends in R and not specifically about the way one can do this via the {future} package.)

I see. So, this is more of a discussion on how it works in base R and the parallel package and what the best practices could be there? Does this also apply for?

Whenever doing this, I wonder if one should at least tell the user that this was done behind the scenes to make them aware of whats happening (including eventual decreases in speed).

If so, maybe this issue is better suited for https://github.com/HenrikBengtsson/Wishlist-for-R/issues - I should I transfer it there?

pat-s commented 4 years ago

If so, maybe this issue is better suited for HenrikBengtsson/Wishlist-for-R/issues - I should I transfer it there?

Ok, feel free to move it :)

I am not sure if there is any change one can make such a large change in the {parallel} package - or if it even makes sense to discuss potential changes there. What do you think?

I mean on the one hand R gets a major version bump so it would be a good time right now to harmonize things - but yeah, I'm unsure if the invested time would pay off?