HenrikBengtsson / doFuture

:rocket: R package: doFuture - Use Foreach to Parallelize via Future Framework
https://doFuture.futureverse.org
84 stars 6 forks source link

Discussion: About reproducible parallel RNG streams #41

Closed pat-s closed 3 years ago

pat-s commented 4 years ago

Continuing our mail discussion here.

The variety of options, especially when using the foreach framework seems overhwelming.

Example 1 ``` r library("doFuture") #> Loading required package: globals #> Loading required package: future #> Loading required package: foreach #> Loading required package: iterators #> Loading required package: parallel # register parallel backend registerDoFuture() plan(multisession, workers = 2) set.seed(123, "L'Ecuyer-CMRG") r15 = foreach(i = rep(3, 2)) %dopar% { runif(i) } set.seed(123, "L'Ecuyer-CMRG") r16 = foreach(i = rep(3, 2)) %dopar% { runif(i) } all.equal(r15, r16) #> [1] "Component 1: Mean relative difference: 0.2142185" #> [2] "Component 2: Mean relative difference: 0.5201712" ``` Created on 2019-12-25 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
Example 2 ``` r library("doFuture") #> Loading required package: globals #> Loading required package: future #> Loading required package: foreach #> Loading required package: iterators #> Loading required package: parallel library("doRNG") #> Loading required package: rngtools #> Loading required package: pkgmaker #> Loading required package: registry #> #> Attaching package: 'pkgmaker' #> The following object is masked from 'package:base': #> #> isFALSE # register parallel backend registerDoFuture() plan(multisession, workers = 2) registerDoRNG(123) r15a = foreach(i = rep(3, 2)) %dopar% { runif(i) } registerDoRNG(123) r16a = foreach(i = rep(3, 2)) %dopar% { runif(i) } all.equal(r15a, r16a) #> [1] TRUE ``` Created on 2019-12-25 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)
HenrikBengtsson commented 4 years ago

So, several things go on here. The underlying issue is that foreach left it to the backends to handle certain things, including parallel RNG. This is very unfortunate because it is really hard to write foreach code that works as-is with different foreach adapters. Ideally, foreach would have implemented parallel RNG itself such that it would work the same everywhere. (This is the approach I'm taking in the future framework, e.g. future.apply). Such a discussion is probably better suited over at https://github.com/RevolutionAnalytics/foreach/issues.

I did consider having doFuture have built-in support for parallel RNG much like future.apply. If I did (or end up doing), then you could do:

library(foreach)
doFuture::registerDoFuture()
set.seed(123) 
y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

and you'd get numerically identical results regardless of future-backend used (as you hoped for in Example 1). Maybe I should just add that - it's pretty easy to do. The downside is that the same foreach() will not use proper parallel RNG if you use:

doParallel::registerDoParallel(4L)

instead. Currently, all existing foreach adapters requires doRNG for parallel RNG. So, you need to do something like you propose in Example 2:

doParallel::registerDoParallel(4L)
doRNG::registerDoRNG(123)

Using doRNG is basically a de-facto standard (= the only solution available) that I didn't want to break. At least, I didn't want to break it or introduce yet another standard, without carefully consider it's consequences.

Howeber, maybe one can argue that if

doParallel::registerDoParallel(4L)
doRNG::registerDoRNG()

remain in control of the end user, and not the developer, then it would be straightforward for the user to switch to:

future::plan("multisession", workers = 4L)
doFuture::registerDoFuture()

and

y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

would work in both cases.

I'll try to think about this more but right now I tempted to add built-in support for parallel RNG to doFuture. It would certainly lower friction for both developers and end users.

HenrikBengtsson commented 4 years ago

Is "Example 2" the canonical way from your point of view right now?

Yes, I think you found a nice balance there. Personally, I think:

doRNG::registerDoRNG()
y <- foreach(...) %dopar% { ... }

is nicer and less confusing than:

library(doRNG)
y <- foreach(...) %dorng% { ... }

One reason is the developer does not have to update from %dopar% to %dorng% when needing parallel RNGs. Also, the latter is error prone because it's easy to forget to do that update and there will not be an warnings or errors telling you you forgot. Third, as argue in my previous comment, sticking with %dopar% everywhere will make it possible to add built-in support for parallel RNG in doFuture without having to update the foreach code.

pat-s commented 4 years ago

Such a discussion is probably better suited over at RevolutionAnalytics/foreach/issues.

Yeah, maybe more people will find it here since I think the other GH repo is not as well known as doFuture. We can't change it anymore now, so let's go with it 😄

From my point of view, I would like to see something like

library(foreach)
doFuture::registerDoFuture(reproducible = TRUE)
set.seed(123) 
y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

Not sure if this is somewhat of an overkill but it would make it very clear by reading only the code that something reproducible is going on here. I kinda miss this explicit statement in lots of scripts because it is often not obvious to readers that the "L'Ecuyer-CMRG" RNG kind will do this (wherever it might be used, in the script or internally in a package). I'll think of doing so in {parallelMap} and the more I think about it, the more I like it.

In my dreams I would like to see this argument everywhere, including {parallel} combined with a help page in the functions explaining what's going on. This would be a very clear and simple explanation of what's going on, even behind the scenes. What do you think?

Being strict with this, it would be of course also be nice if all {future} packages would go this way then to be consistent. The thinking is again: "Whenever you want to have reproducible numbers when going parallel, set reproducible = TRUE in whatever {future} related function you are calling". future.seed is already there but maybe not really descriptive by its name to the end user? I know that changing the argument name is not really an option for backward comp and adding another one might again make the situation more complex :/ (I don't know, just some random thoughts of mine - maybe a bit to revolutionary 😄 )


dopar vs dorng

I see no good reason why the second one should even exist (just from a "do we need it" point of view). I think for simplicity and to keep things tidy, we should just go with one operator to indicate one is going parallel and deal with the reproducibility in some other way than switching to a different operator. I haven't seen any people using it in the wild so I am not sure how "famous" it really is. I also was not aware of it before you mentioned it lately to me - but that is of course not a robust reference of its popularity.

HenrikBengtsson commented 4 years ago

Quick comment: I think there's room for more than one definition of reproducible, e.g. numerical and statistical. In the future framework (as in future.apply), I aimed at supporting numerically identical RNG reproducibility so backend or number workers does not matter. However, that comes with some overhead and in many statistical analyses / simulations it should be sufficient use valid parallel RNG per worker (not per element), cf. https://github.com/HenrikBengtsson/future.apply/issues/20. The latter is basically what's used in parallel::mclapply(..., mc.set.seed = TRUE).

HenrikBengtsson commented 4 years ago

Actually, when thinking more about it, it might be a bad idea to use:

doRNG::registerDoRNG()

instead of %dorng%. My reason for thinking so now is that %dorng% is in control of the developer and they are the person who knows whether parallel RNG streams are needed or not. With doRNG::registerDoRNG() we leave it to the user to make sure correct RNG settings are used, which should not be their responsibility (*).

(*) One could, of course, imagine the developer calling doRNG::registerDoRNG() internally, but I'm not sure if one can undo that with on.exit() afterward. Also, using %dorng% will make it much more clear in the code when parallel RNG is needed.

pat-s commented 4 years ago

I see.

Well, there are probably pros and cons for both. Part of the problem is that

I wonder what steps might be most promising to not be too pushy to the devs but also pointing out our concerns/ideas.

HenrikBengtsson commented 4 years ago
  • {foreach} does not have a dev repo on Github (at least its somewhat hidden and not linked in the DESCRIPTION file)

The official GitHub foreach repos is https://github.com/RevolutionAnalytics/foreach/. It used to be hosted on the R-Forge SVN server but moved to GitHub about mid 2019, I think. At this time, maintenance was also handed over too Hong Ooi. Drop an issue asking if they can add URL and BugReports fields to the DESCRIPTION file so it'll show up on CRAN.

  • {doRNG} does not give a nice overview in terms of usage and what might be the best practice from their point of view

I would drop an issue at https://github.com/renozao/doRNG/issues with a wish for clarification/thoughts from the maintainer on whether to use registerDoRNG() or %dorng%. It might help clarify the question if you scribble down what you currently think are the pros and cons too. Hopefully, the outcome of such a discussion will find it's way into the documentation/help pages.

  • {foreach} does not talk about/list it explicitly in their package as a backend

Yes, this is unfortunate. It would be useful to know the history of doRNG in relationship to foreach, e.g. where they talking to each other or did doRNG fill a gap that the foreach folks didn't want to fill in themselves, ... Knowing a bit more about this would help understand the current design and philosophy better which in turn help when you propose improvements.

(Personally, I think foreach should have built-in support for parallel RNGs such that code and results would not depend on the backend used.

HenrikBengtsson commented 3 years ago

After you created the 'Native support for a reproducible parallel RNG streams?' issue over at foreach, I think we can close this one here. The foreach issue covers the problems and suggestions needed to go forward there.

For what's it worth, with future 1.21.0 now on CRAN, the next release of doFuture will produce a more informative and specific warning message when one forgets to declare the use of the RNG;

> library(doFuture)
> registerDoFuture()
> y <- foreach(x=1:2) %dopar% { rnorm(x) }
Warning message:
UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without 
declaring so. There is a risk that those random numbers are not statistically sound and the overall results 
might be invalid. To fix this, use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that
proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, 
set option 'future.rng.onMisuse' to "ignore".