Closed LTLA closed 5 years ago
Let's see what's currently implemented:
std::random_device
and time (the latter because of mingw where std::random_device
is deterministic).I see two additional uses cases from what you are describing.
You are correctly pointing out that seeding with only 32 bits is not very good. I would still prefer it, though, if the user could make some sense from the command. Something like
dqset.seed(c(4711, 42))
could be used to specify two (or more) integers. This is the sort of usage that I am missing when using raw vectors.
The integer vector might then either be used directly or fed to sed_seq_fe
to provide the seeds. Alternatively, the bit64
package could be used. This provides (signed) 64bit integers for R users by reinterpreting the bit pattern of doubles. Using them is quite simple:
#include <cstdint>
#include <Rcpp.h>
// [[Rcpp::export]]
void printSeed1(Rcpp::NumericVector seed_vec) {
uint64_t _seed;
double seed;
static_assert(sizeof seed == sizeof _seed, "Size does not match!");
for (int i = 0; i < seed_vec.length(); ++i) {
seed = seed_vec(i);
std::memcpy(&_seed, &seed, sizeof seed);
Rcpp::Rcout << "double: " << seed << std::endl;
Rcpp::Rcout << "uint64: " << _seed << std::endl;
}
}
// [[Rcpp::export]]
void printSeed2(Rcpp::NumericVector seed_vec) {
int64_t _seed;
double seed;
static_assert(sizeof seed == sizeof _seed, "Size does not match!");
for (int i = 0; i < seed_vec.length(); ++i) {
seed = seed_vec(i);
std::memcpy(&_seed, &seed, sizeof seed);
Rcpp::Rcout << "double: " << seed << std::endl;
Rcpp::Rcout << "int64: " << _seed << std::endl;
}
}
/*** R
library(bit64)
seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))
seed
unclass(seed)
printSeed1(seed)
printSeed2(seed)
*/
Output:
> library(bit64)
> seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))
> seed
integer64
[1] 1234567890987654321 12345 -1234567890987654321
> unclass(seed)
[1] 3.813124e-226 6.099240e-320 -1.107996e+226
> printSeed1(seed)
double: 3.81312e-226
uint64: 1234567890987654321
double: 6.09924e-320
uint64: 12345
double: -1.108e+226
uint64: 17212176182721897295
> printSeed2(seed)
double: 3.81312e-226
int64: 1234567890987654321
double: 6.09924e-320
int64: 12345
double: -1.108e+226
int64: -1234567890987654321
So an integer64
vector is actually a double
vector with an additional class and can be moved to C++ as Rcpp::NumericVector
. Reinterpreting these doubles as signed 64 bit integers gives us the original interpretation in R. Reinterpreting them as unsigned 64 bit integers gives a different interpretation for negative/large values, but that's more useful when using them as seed.
Overall I would prefer using an IntegerVector
, a single integer64
or an integer64
vector as input for dqset.seed
.
You want some randomize
function which uses R's RNG to provide you with a new seed for the RNG in use. Getting a random uint32_t
from R is quite easy at the C++ level:
uint32_t random32() {
return R::runif(0, 1) * 4294967296; /* 2^32 */
}
This works correctly for the default Mersenne-Twister. For other RNGs implemented in R, this will probably miss some possible integer values, since the RNG does not provide 32 bits of randomness. I don't see an easy way to solve this, though. So I would find it easiest to have a C++ function which calls R's RNG to provide whatever extended seed structure is used. That structure can then either be returned to R, so that the user can safe the values for reproducing the results in the future. Or it can be fed to the RNGs, just like above.
BTW, I like the idea of an randomize
function, since I could use that instead of std::random_device
. That's from the random
header which one is not supposed to use in R extensions. And the analogue from boost requires linking so that the BH
package is not enough. Of course, this would also mean that I should no longer use Mersenne-Twister as default RNG, since using MT (default in R) to generate seeds for MT is considered a bad idea. But then there are other reasons to remove MT: Currently it also comes from the random
header. And the version from boost misses a few constexpr
. In addition, all other RNGs support multiple streams some way or other, which I would like to make available from R. With MT one can only use poor-man's streams, i.e. consecutive seeds, and hope for the best ...
This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.
Hmm, I do not quite understand what you mean by that. Do you create a new RNG on every function call?
BTW, is the source code of these two packages on GitHub or similar. I am curious how you are using the package. :-)
Thanks for the comments @rstub.
I can empathize with your wariness towards raw vectors, as I felt the same way when I first saw them. (What's this 0e f6 a3
nonsense? Must be a segmentation fault!) But I have grudgingly come around to concede that they are useful in some cases. I'll start by pointing out that you can directly create raw-vector seeds:
dqset.seed(as.raw(c(0x10, 0x20, 0x30, 0x40)))
This is about as interpretable as a integer-vector seed; the values aren't very user-friendly in either case. Yes, it's more typing, but someone going for an easy manual initialization should just supply an integer scalar directly; they'll never do it enough times to run into statistical biases from the limited state space.
If manual usability is critical, integer vectors are probably better than the bit64
approach. The latter requires an additional Depends
and doesn't allow us to avoid using vectors (e.g., if you're seeding one of the PRNGs with 128 bits of state in your package's C++ code). As you say, sampling these integers across their full range would be relatively easy in C++, though passing them to R would require some care with the signed/unsigned transition. And I, at least, do want some method of getting the random seeds in my R session - see below.
I'm also surprised that bit64
involves reinterpreting a double
like that, I thought that would be implementation-specific behaviour. Maybe it's always the same on the platforms on which R compiles...
The clearest example is probably in DropletUtils in the emptyDrops
function. This operates on single-cell genomics data where the C++ code loops across cells and does a series of Monte Carlo simulations to compute a per-cell p-value. Normally, I would just use R's PRNG within C++, i.e., initialize the PRNG at the start of the C++ function and run through all cells without resetting the PRNG or anything.
The problem is that this code is potentially parallelized at the R-level. This is achieved via the the bplapply
call (from the BiocParallel package), with a variety of parallelization schemes defined by the BPPARAM
argument. This means that I get a different series of random numbers depending on the parallelization scheme, i.e., number of workers and how the jobs are partitioned across those workers. Obviously, this is not desirable as I would like to get the same results regardless of the chosen parallelization.
The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by set.seed()
and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers.
This approach does, however, rely on a lot of re-seeding, which means that statistical biases may manifest in the random numbers if I can only seed a part of the state space. Which brings us to #10 and this issue.
If raw vectors are too unpalatable, I just did a refactor to use integer vectors - this is on the integerseed
branch of my fork here. As anticipated, some work was required to guarantee a sensible uint32_t
to int
conversion, which would otherwise be implementation-defined if we left it to a static_cast
. The relevant file is convert_seed.h
, which contains all the new goodies.
Thanks for your comments, @LTLA. You almost had me convinced that raw vectors are a good idea. However, your code triggered my urge to experiment with Melissa O'Neil's seed_seeq_fe
mentioned above. And I think this looks quite promising:
int
s, which might be supplied by the user or derived from R's RNG.int
s from it that can used to reconstruct an identical copy.Example:
#include <Rcpp.h>
#include "randutils.hpp"
int32_t R_random_32 () {
constexpr double upper_limit=2147483648;
double val=R::unif_rand() * upper_limit * 2 - upper_limit;
if (val >= upper_limit || val < -upper_limit) { val=-upper_limit; } // Absolutely avoid over-/underflow.
return static_cast<int32_t>(val);
}
// [[Rcpp::export]]
Rcpp::IntegerVector getParam() {
Rcpp::IntegerVector raw_seeds(4);
std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32);
randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end());
Rcpp::IntegerVector param(4);
seeder.param(param.begin());
return param;
}
// [[Rcpp::export]]
Rcpp::IntegerVector getSeeds(int n,
Rcpp::Nullable<Rcpp::IntegerVector> param = R_NilValue) {
Rcpp::IntegerVector raw_seeds(4);
if (param.isNull()) {
std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32);
} else {
raw_seeds = param.as();
}
randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end());
Rcpp::IntegerVector seeds(n);
seeder.generate(seeds.begin(), seeds.end());
return seeds;
}
/*** R
set.seed(42)
getSeeds(6)
getSeeds(6)
set.seed(42)
getSeeds(6)
getSeeds(6)
set.seed(42)
one <- getParam()
two <- getParam()
getSeeds(6, one)
getSeeds(6, two)
getSeeds(6, 42)
getSeeds(6, c(4711, 42))
*/
Output:
>set.seed(42)
>getSeeds(6)
[1] -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043
> getSeeds(6)
[1] 909206077 1132750064 872894922 151901752 1480135365 -947383956
> set.seed(42)
> getSeeds(6)
[1] -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043
> getSeeds(6)
[1] 909206077 1132750064 872894922 151901752 1480135365 -947383956
> set.seed(42)
> one <- getParam()
> two <- getParam()
> getSeeds(6, one)
[1] -1368039275 -1100785627 1327213203 1033422429 1573423832 475391043
> getSeeds(6, two)
[1] 909206077 1132750064 872894922 151901752 1480135365 -947383956
> getSeeds(6, 42)
[1] -850130249 -1625411987 2046530742 -713526308 1691623607 2099784219
> getSeeds(6, c(4711, 42))
[1] 1485050188 -994439850 532523273 -1671404795 -1234644656 810452082
Right now I am unsure, though, how that handles the unsigned to signed conversion. So the above results might be implementation specific (at least until C++20 is used as I just learned). BTW, did you consider std::memcopy
for that?
Concerning you parallel RNG usage:
The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by set.seed() and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers.
I am not sure that this is the way PCG should be used in parallel. My understanding is that one would use the same seed for all the RNGs. The different workers would then do one of two things:
In both cases the random numbers generated should be independent, although skipping ahead is probably unpractical given the (rather short) period of pcg32. Using different seeds is similar to skipping ahead, but one does not know how many numbers one can draw. You are mitigating this using in addition different streams. But the different streams alone should be enough.
Regarding seed_seq_fe
: I don't have much of an opinion here. It seems cool but I was happy to rely on R's MT stream being "random enough" for seeding purposes. Happy to go with whatever you decide.
BTW, did you consider
std::memcopy
for that?
Briefly. But I didn't know whether a memcpy
from an unsigned to signed type would behave in a standard way, and I didn't want to take responsibility for introducing difficult-to-track implementation-specific behaviour.
But the different streams alone should be enough.
That's probably true. I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?) It would make it slightly easier for me if I just had to create one PCG seed vector in my functions.... but not that much, and it would have to be done in the serially executed part of R anyway.
Edit, 5 hours later: I spent some time thinking about the use of seed_seq_fe
here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding. There's no need to complicate the dqrng API with extra work and bug potential, e.g., signed vs. unsigned, how to guarantee randomness/full state space exploration when more than four integers are requested.
Perhaps the use of the seed sequence would be useful for handling the case where dqset.seed()
is supplied with a single integer. Otherwise, the current changes in #10 (now switched to using integer vectors, as suggested) seem sufficient for all but the most advanced PRNG users.
I spent some time thinking about the use of seed_seq_fe here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding.
ACK. It might make sense for seeding MT, but MT support in dqrng is going to be removed anyway.
I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?)
Unfortunately I have no such contact. But I have seen the issue you raised in the GitHub repository. Let's hope you will get an answer there. The blog article you linked is indeed rather unclear. Section 4.3.2 of her paper reads quite different:
Every choice of c [the additive constant in the LCG] results in a different sequence of numbers that has none of its pairs of successive outputs in common with another sequence.
That's why I used constant seed and consecutive stream IDs in the parallel vignette.
Fixed in #10.
For posterity's sake, I had a go at using dqrng with multiple streams vs multiple seeds:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include "convert_seed.h"
#include "pcg_random.hpp"
// [[Rcpp::export(rng=false)]]
Rcpp::IntegerVector spew_raw_numbers(Rcpp::IntegerVector seed, int stream, int N=10000) {
pcg32 rng(dqrng::convert_seed<uint64_t>(seed), stream);
Rcpp::IntegerVector output(N);
for (auto& o : output) {
uint32_t sampled=rng();
// Converting uint to int.
constexpr uint32_t max_int=2147483647;
if (sampled <= max_int) {
o=static_cast<int>(sampled);
} else {
constexpr uint32_t max_uint=-1;
o=-static_cast<int>(max_uint - sampled) - 1;
}
}
return output;
}
Rcpp::sourceCpp("streamer.cpp")
collected0 <- collected1 <- collected2 <- numeric(1000)
for (i in seq_along(collected1)) {
out <- dqrng::generateSeedVectors(2)
x1_1 <- spew_raw_numbers(out[[1]], 1, 20000)
x1_1a <- head(x1_1, 10000)
x1_1b <- tail(x1_1, 10000)
x2_2 <- spew_raw_numbers(out[[2]], 2)
x1_2 <- spew_raw_numbers(out[[1]], 2)
# Switch functions to test different types of correlations:
# FUN <- cor
# FUN <- function(...) cor(..., method="spearman")
FUN <- function(...) max(abs(ccf(..., plot=FALSE)$acf))
collected0[i] <- FUN(x1_1a, x1_1b)
collected1[i] <- FUN(x1_1a, x2_2)
collected2[i] <- FUN(x1_1a, x1_2)
}
mean(abs(collected0))
mean(abs(collected1))
mean(abs(collected2))
max(abs(collected0))
max(abs(collected1))
max(abs(collected2))
Yeah, it's not very sophisticated, but it's the best I've got without further knowledge of the failure modes of this PRNGs. I really wish someone would chime in on imneme/pcg-cpp#49...
For the amount of random numbers I'm generating, there are no observable differences in mean or maximum correlation from a jump-ahead approach on a single stream compared to multiple streams with the same seed or multiple streams with different seeds. So both our approaches are probably fine. Probably.
As suggested by @rstub on #10, I'll try to justify why I chose raw vectors for generating >32 bit seeds.
Why raw vectors?
Raw vectors are
unsigned char
vectors represented in R, with size in bytes equal to its length (plus some overhead for R vectors). This provides a nice representation in which one can specify the settings for an arbitrary number of bits. Each entry of the raw vector stores 8 bits, so to store more bits, one just uses a longer vector. This is easy to construct in R (see the code in #10); it is also easy to reason with in C++ requiring only a few bit shifts and masks to create auintX_t
seed integer from a raw vector.Let's now consider the integer as an alternative. R uses a 32-bit signed integer representation, so we would need to use two integers to generate a 64-bit seed. Thus, we could use an integer vector instead of a raw vector. Generation of this seed in R could then feasibly be done with:
... which would then be converted into unsigned integers and, with some bit operations, used to generate a 64-bit seed. The problem is that the above code will not span the entire 32-bit range of possible signed integers, as
-INT_MAX
is lost to representNA_INTEGER
in R. The correct way to do it:... causes a
Error: cannot allocate vector of size 16.0 Gb
. And, to be honest, I find reasoning about signed-to-unsigned integer conversions to be difficult, though this might just be my problem.The final numerical alternative is to use a
double
. My first attempt was to actually generate a number within the range of possible 64-bit unsigned integers, and pass that to C++:The problem here is that I am not guaranteed to sample from the full space of seeds, as double-precision numbers cannot represent every 64-bit integer perfectly. There will be some integers that simply will never be sampled, i.e.,
runif
cannot generate numbers within[i, i+1)
for some integersi
. The coercion fromdouble
touintX_t
in C++ is also difficult as we need to check that the number can fit within the unsigned range; this is not straightforward if we cannot accurately represent the maximum unsigned number as a double for comparison. (And the penalty of not doing so is extreme - undefined behaviour!)My second, much shorter-lived attempt was to try to directly convert the bits in a floating point number to a 64-bit unsigned integer. This is probably not a good idea - IIRC, it gets into implementation-specific details about how data are represented on different platforms. Moreover, even if this were portably possible, it is not clear how to sample a floating point number in a way that would get an equal chance of setting each bit.
So, by process of elimination, I ended up with a design using raw vectors. I have not considered using strings because
\0
would need to be sampled to offer 8 bits of randomness per character.Why is
generateRawSeeds()
an R function?This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.
In this context, it is sensible to have
generateRawSeeds()
as an R function. I can use this function in my package functions to generate seeds in a manner that responds to user-specifiedset.seed()
. As an added bonus, it also allows users in interactive sessions to callgenerateRawSeeds()
to seed dqrng's R-visible distribution functions with 64-bit of randomness. (Possibly want to setbits=64
by default).It is possible to translate this into a C++ function, and just have the R function as a wrapper. This would allow generation of raw-vector seeds from C++ as well, which might be desirable in some settings. I have not done so because... well, I guess I didn't really need it. A direct translation will also have some issues with replacing
sample
withstd::shuffle
due to the latter's lack of reproducibility across platforms. ~One could useboost::random::uniform_int_distribution
if R's API exposed the raw PRNG... which it doesn't seem to. In scran, I have added my own C++ shuffling algorithm to ensure that I get the same result across platforms, but this was more of a change than I was willing to make here without further approval.~ It should be possible to do something fairly simple withR::unif_rand() * 256
if this is considered to be necessary.