The "Random-number generation" section of vignettes(topic = "parallel", package = "parallel") describes concerns about pseudo-random numbers in parallel computing. If we have different RNG streams on different parallel processes, we want those streams to be independent. However, there is a worry that those streams will eventually overlap and produce correlated results. All algorithmic PRNGs are periodic, so overlap is guaranteed if the streams runs forever, but we want this outcome to be as unlikely as possible in practice.
The parallel package provides a solution that works in many cases: Select the "L'Ecuyer-CMRG" algorithm and generate widely-spaced RNG streams one after another in recursive fashion (see L'Ecuyer et al (2001)). However, this is not feasible for all scenarios. In targets, for example, each task absolutely must set up its own reproducible RNG stream independently of all the other tasks. If the starting seed of one task depends on the starting seed of a different task, then targets cannot achieve its most essential Make-like reproducibility goals. See https://github.com/ropensci/targets/issues/1139 for a discussion.
Each task, or "target", already has its own unique user-defined name. targets throws an error and refuses to run the pipeline unless these names are all unique.
tar_seed_create() accepts the name and computes a SHA512 hash, so it is effectively "counter-based RNG" as described in the last paragraph of L'Ecuyer et al. (2017) Section 3: "Main classes of RNGs for simulation".
The counter-based RNG from (2) supplies seeds to the parallel task-specific RNG streams.
Justification for this approach comes from L'Ecuyer et al. (2017) Section 4 "How to produce parallel streams and sub streams" Subsection "A single RNG with a "random" seed for each stream". The authors show that the probability of overlap is vanishingly small when the RNG streams are constructed this way.
The only snags are:
The task name is not exactly a counter, but that seems okay because the recursive integer aspect of the counter is arbitrary for my purposes.
I need to convert the SHA512 hash to a 32-bit integer because the seed argument of set.seed() cannot accept input longer than 32 bits.
I posted this issue/wish because of (2).
Wish
tar_seed_create() would be much more robust if I could supply all the bytes of the SHA512 hash string to set.seed(). This change would help in similar scenarios where the "counter-based RNG + randomly-generated seed" approach is preferred over recursive L'Ecuyer streams. And based on vignettes(topic = "parallel", package = "parallel"), I feel this enhancement would serve the interests of R core.
Current behavior
If I supply a vector of 32-bit integers, only the first integer is used for the seed:
And if I try a scalar longer than 32 bits, I get an error:
suppressPackageStartupMessages(library(bit64))
large_seed <- as.integer64("12345678901234567890")
set.seed(seed = large_integer)
#> Error in eval(expr, envir, enclos): object 'large_integer' not found
set.seed(as.double(.Machine$integer.max) + 1)
#> Warning in set.seed(as.double(.Machine$integer.max) + 1): NAs introduced by
#> coercion to integer range
#> Error in set.seed(as.double(.Machine$integer.max) + 1): supplied seed is not a valid integer
.Random.seed is much longer than a single integer, especially for Mersenne-Twister. However, user-defined .Random.seed objects are not necessarily valid, and they are algorithm-dependent anyway. Users should only create a .Random.seed object if they know the algorithm they are using and they call a function like parallel::nextRNGStream().
Background
The "Random-number generation" section of
vignettes(topic = "parallel", package = "parallel")
describes concerns about pseudo-random numbers in parallel computing. If we have different RNG streams on different parallel processes, we want those streams to be independent. However, there is a worry that those streams will eventually overlap and produce correlated results. All algorithmic PRNGs are periodic, so overlap is guaranteed if the streams runs forever, but we want this outcome to be as unlikely as possible in practice.The
parallel
package provides a solution that works in many cases: Select the "L'Ecuyer-CMRG" algorithm and generate widely-spaced RNG streams one after another in recursive fashion (see L'Ecuyer et al (2001)). However, this is not feasible for all scenarios. Intargets
, for example, each task absolutely must set up its own reproducible RNG stream independently of all the other tasks. If the starting seed of one task depends on the starting seed of a different task, thentargets
cannot achieve its most essentialMake
-like reproducibility goals. See https://github.com/ropensci/targets/issues/1139 for a discussion.Borrowing pieces of L'Ecuyer et al. (2017), I implemented the safeguard described in https://docs.ropensci.org/targets/reference/tar_seed_create.html#rng-overlap, and the R code is here in
tar_seed_create()
. How it works:targets
throws an error and refuses to run the pipeline unless these names are all unique.tar_seed_create()
accepts the name and computes a SHA512 hash, so it is effectively "counter-based RNG" as described in the last paragraph of L'Ecuyer et al. (2017) Section 3: "Main classes of RNGs for simulation".Justification for this approach comes from L'Ecuyer et al. (2017) Section 4 "How to produce parallel streams and sub streams" Subsection "A single RNG with a "random" seed for each stream". The authors show that the probability of overlap is vanishingly small when the RNG streams are constructed this way.
The only snags are:
seed
argument ofset.seed()
cannot accept input longer than 32 bits.I posted this issue/wish because of (2).
Wish
tar_seed_create()
would be much more robust if I could supply all the bytes of the SHA512 hash string toset.seed()
. This change would help in similar scenarios where the "counter-based RNG + randomly-generated seed" approach is preferred over recursive L'Ecuyer streams. And based onvignettes(topic = "parallel", package = "parallel")
, I feel this enhancement would serve the interests of R core.Current behavior
If I supply a vector of 32-bit integers, only the first integer is used for the seed:
Created on 2023-10-13 with reprex v2.0.2
And if I try a scalar longer than 32 bits, I get an error:
Created on 2023-10-13 with reprex v2.0.2
What about .Random.seed?
.Random.seed
is much longer than a single integer, especially for Mersenne-Twister. However, user-defined.Random.seed
objects are not necessarily valid, and they are algorithm-dependent anyway. Users should only create a.Random.seed
object if they know the algorithm they are using and they call a function likeparallel::nextRNGStream()
.