HenrikBengtsson / PSCBS

🔬 R package: Analysis of Parent-Specific DNA Copy Numbers
https://cran.r-project.org/package=PSCBS
7 stars 4 forks source link

PARALLEL: Reproducibility and random seeds #30

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 8 years ago

By introducing parallel processing, we have to worry about the random generator, because DNAcopy relies on it.

We already have argument seed for the segmentation methods. However, that is used to set the random seed when the function is first called (and undone afterward) and then individual chromosomes (or known segments) are segmented sequentially / separately, the same function is called on the subset of interest with argument seed=NULL. This guarantees reproducibility if the initial call is set with a random seed.

However, with asynchronous/parallel processing of individual chromosomes / known segments, the order of segmentation is no longer guaranteed. Because of this, setting the random seed once / at the very beginning no longer guarantees the same results. We need to have a way to control the random seeds throughout the process. This might explain this validation error.

We should probably not reuse the initial random seed. However, we might be able to pre-generate an additional set from the original seed and pass them down one by one (as they are consumed) in a controlled order. Sketch:

set.seed(seed)
seedsChrs <- sample(size=nbrOfChromosomes)
seedsSegs <- sample(size=nbrOfKnownSegments)
...

for (kk in seq_len(nbrOfChromosomes)) {
  seedKK <- seedsChrs[kk]
  fitList[[kk]] %<=% {
     fit <- segmentByCBS(..., seed=seedKK)
     ...
  }
}
HenrikBengtsson commented 8 years ago

Adopted from Section 6 of vignette("parallel", package="parallel"):

fsample <- function(x, size=10L, seed=NULL, debug=FALSE) {
  base::RNGkind("L'Ecuyer-CMRG")
  if (!is.null(seed)) set.seed(seed)
  .seed <- .Random.seed
  res <- listenv::listenv()
  for (ii in seq_len(size)) {
    .seed <- parallel::nextRNGStream(.seed)
    res[[ii]] %<=% {
      .GlobalEnv$.Random.seed <- .seed
      sample(x, size=1L)
    }
  }
  unlist(res)
} # fsample()

and

> library("future")
> plan(eager)
> fsample(0:9)
 [1] 7 4 3 5 6 2 3 8 2 8
> fsample(0:9)
 [1] 9 2 9 5 4 3 7 8 9 6
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4

> plan(lazy)
> fsample(0:9)
 [1] 0 4 3 1 4 8 6 6 0 3
> fsample(0:9)
 [1] 3 0 2 9 5 3 8 6 9 2
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4

> plan(multicore)
> fsample(0:9)
 [1] 9 2 1 6 4 3 1 7 9 9
> fsample(0:9)
 [1] 5 6 6 9 3 0 9 9 3 7
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4
> fsample(0:9, seed=42L)
 [1] 8 4 5 5 8 1 4 9 5 4
HenrikBengtsson commented 8 years ago

Added test for the above to the future package, cf. https://github.com/HenrikBengtsson/future/commit/1c50b6fabdbfd0788d5a51b6359689a29a02770a

HenrikBengtsson commented 8 years ago

We now have numerical reproducibility regardless what type of futures are used. Package (commit 725d1be) verified using R CMD check --as-cran on: