HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
134 stars 4 forks source link

Permit larger seed argument values in set.seed() #152

Open wlandau opened 1 year ago

wlandau commented 1 year ago

Background

The "Random-number generation" section of vignettes(topic = "parallel", package = "parallel") describes concerns about pseudo-random numbers in parallel computing. If we have different RNG streams on different parallel processes, we want those streams to be independent. However, there is a worry that those streams will eventually overlap and produce correlated results. All algorithmic PRNGs are periodic, so overlap is guaranteed if the streams runs forever, but we want this outcome to be as unlikely as possible in practice.

The parallel package provides a solution that works in many cases: Select the "L'Ecuyer-CMRG" algorithm and generate widely-spaced RNG streams one after another in recursive fashion (see L'Ecuyer et al (2001)). However, this is not feasible for all scenarios. In targets, for example, each task absolutely must set up its own reproducible RNG stream independently of all the other tasks. If the starting seed of one task depends on the starting seed of a different task, then targets cannot achieve its most essential Make-like reproducibility goals. See https://github.com/ropensci/targets/issues/1139 for a discussion.

Borrowing pieces of L'Ecuyer et al. (2017), I implemented the safeguard described in https://docs.ropensci.org/targets/reference/tar_seed_create.html#rng-overlap, and the R code is here in tar_seed_create(). How it works:

  1. Each task, or "target", already has its own unique user-defined name. targets throws an error and refuses to run the pipeline unless these names are all unique.
  2. tar_seed_create() accepts the name and computes a SHA512 hash, so it is effectively "counter-based RNG" as described in the last paragraph of L'Ecuyer et al. (2017) Section 3: "Main classes of RNGs for simulation".
  3. The counter-based RNG from (2) supplies seeds to the parallel task-specific RNG streams.

Justification for this approach comes from L'Ecuyer et al. (2017) Section 4 "How to produce parallel streams and sub streams" Subsection "A single RNG with a "random" seed for each stream". The authors show that the probability of overlap is vanishingly small when the RNG streams are constructed this way.

The only snags are:

  1. The task name is not exactly a counter, but that seems okay because the recursive integer aspect of the counter is arbitrary for my purposes.
  2. I need to convert the SHA512 hash to a 32-bit integer because the seed argument of set.seed() cannot accept input longer than 32 bits.

I posted this issue/wish because of (2).

Wish

tar_seed_create() would be much more robust if I could supply all the bytes of the SHA512 hash string to set.seed(). This change would help in similar scenarios where the "counter-based RNG + randomly-generated seed" approach is preferred over recursive L'Ecuyer streams. And based on vignettes(topic = "parallel", package = "parallel"), I feel this enhancement would serve the interests of R core.

Current behavior

If I supply a vector of 32-bit integers, only the first integer is used for the seed:

differences <- 0L
set.seed(c(1L, 1L))
result <- runif(n = 1L)
for (index in seq_len(1000L)) {
  set.seed(c(1L, index))
  differences <- differences + (runif(n = 1L) != result)
}
differences
#> [1] 0

Created on 2023-10-13 with reprex v2.0.2

And if I try a scalar longer than 32 bits, I get an error:

suppressPackageStartupMessages(library(bit64))
large_seed <- as.integer64("12345678901234567890")
set.seed(seed = large_integer)
#> Error in eval(expr, envir, enclos): object 'large_integer' not found
set.seed(as.double(.Machine$integer.max) + 1)
#> Warning in set.seed(as.double(.Machine$integer.max) + 1): NAs introduced by
#> coercion to integer range
#> Error in set.seed(as.double(.Machine$integer.max) + 1): supplied seed is not a valid integer

Created on 2023-10-13 with reprex v2.0.2

What about .Random.seed?

.Random.seed is much longer than a single integer, especially for Mersenne-Twister. However, user-defined .Random.seed objects are not necessarily valid, and they are algorithm-dependent anyway. Users should only create a .Random.seed object if they know the algorithm they are using and they call a function like parallel::nextRNGStream().

rstub commented 1 year ago

This would be great! In dqrng I allow seeding with 64bits via a 2-element integer vector.

EDIT: Added CRAN link /@HenrikBengtsson