Closed HenrikBengtsson closed 8 years ago
Adopted from Section 6 of vignette("parallel", package="parallel")
:
fsample <- function(x, size=10L, seed=NULL, debug=FALSE) {
base::RNGkind("L'Ecuyer-CMRG")
if (!is.null(seed)) set.seed(seed)
.seed <- .Random.seed
res <- listenv::listenv()
for (ii in seq_len(size)) {
.seed <- parallel::nextRNGStream(.seed)
res[[ii]] %<=% {
.GlobalEnv$.Random.seed <- .seed
sample(x, size=1L)
}
}
unlist(res)
} # fsample()
and
> library("future")
> plan(eager)
> fsample(0:9)
[1] 7 4 3 5 6 2 3 8 2 8
> fsample(0:9)
[1] 9 2 9 5 4 3 7 8 9 6
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
> plan(lazy)
> fsample(0:9)
[1] 0 4 3 1 4 8 6 6 0 3
> fsample(0:9)
[1] 3 0 2 9 5 3 8 6 9 2
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
> plan(multicore)
> fsample(0:9)
[1] 9 2 1 6 4 3 1 7 9 9
> fsample(0:9)
[1] 5 6 6 9 3 0 9 9 3 7
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
[1] 8 4 5 5 8 1 4 9 5 4
Added test for the above to the future package, cf. https://github.com/HenrikBengtsson/future/commit/1c50b6fabdbfd0788d5a51b6359689a29a02770a
We now have numerical reproducibility regardless what type of futures are used. Package (commit 725d1be) verified using R CMD check --as-cran
on:
By introducing parallel processing, we have to worry about the random generator, because DNAcopy relies on it.
We already have argument
seed
for the segmentation methods. However, that is used to set the random seed when the function is first called (and undone afterward) and then individual chromosomes (or known segments) are segmented sequentially / separately, the same function is called on the subset of interest with argumentseed=NULL
. This guarantees reproducibility if the initial call is set with a random seed.However, with asynchronous/parallel processing of individual chromosomes / known segments, the order of segmentation is no longer guaranteed. Because of this, setting the random seed once / at the very beginning no longer guarantees the same results. We need to have a way to control the random seeds throughout the process. This might explain this validation error.
We should probably not reuse the initial random seed. However, we might be able to pre-generate an additional set from the original seed and pass them down one by one (as they are consumed) in a controlled order. Sketch: