daqana / dqrng

Fast Pseudo Random Number Generators for R
https://daqana.github.io/dqrng/
Other
42 stars 8 forks source link

Design of seeds with more than 32 bits of randomness #11

Closed LTLA closed 5 years ago

LTLA commented 5 years ago

As suggested by @rstub on #10, I'll try to justify why I chose raw vectors for generating >32 bit seeds.

Why raw vectors?

Raw vectors are unsigned char vectors represented in R, with size in bytes equal to its length (plus some overhead for R vectors). This provides a nice representation in which one can specify the settings for an arbitrary number of bits. Each entry of the raw vector stores 8 bits, so to store more bits, one just uses a longer vector. This is easy to construct in R (see the code in #10); it is also easy to reason with in C++ requiring only a few bit shifts and masks to create a uintX_t seed integer from a raw vector.

Let's now consider the integer as an alternative. R uses a 32-bit signed integer representation, so we would need to use two integers to generate a 64-bit seed. Thus, we could use an integer vector instead of a raw vector. Generation of this seed in R could then feasibly be done with:

sample((-.Machine$integer.max):(.Machine$integer.max), 2, replace=TRUE)

... which would then be converted into unsigned integers and, with some bit operations, used to generate a 64-bit seed. The problem is that the above code will not span the entire 32-bit range of possible signed integers, as -INT_MAX is lost to represent NA_INTEGER in R. The correct way to do it:

X <- c(NA, (-.Machine$integer.max):(.Machine$integer.max))
sample(X, 2, replace=TRUE)

... causes a Error: cannot allocate vector of size 16.0 Gb. And, to be honest, I find reasoning about signed-to-unsigned integer conversions to be difficult, though this might just be my problem.

The final numerical alternative is to use a double. My first attempt was to actually generate a number within the range of possible 64-bit unsigned integers, and pass that to C++:

runif(1, 0, 2^64-1)

The problem here is that I am not guaranteed to sample from the full space of seeds, as double-precision numbers cannot represent every 64-bit integer perfectly. There will be some integers that simply will never be sampled, i.e., runif cannot generate numbers within [i, i+1) for some integers i. The coercion from double to uintX_t in C++ is also difficult as we need to check that the number can fit within the unsigned range; this is not straightforward if we cannot accurately represent the maximum unsigned number as a double for comparison. (And the penalty of not doing so is extreme - undefined behaviour!)

My second, much shorter-lived attempt was to try to directly convert the bits in a floating point number to a 64-bit unsigned integer. This is probably not a good idea - IIRC, it gets into implementation-specific details about how data are represented on different platforms. Moreover, even if this were portably possible, it is not clear how to sample a floating point number in a way that would get an equal chance of setting each bit.

So, by process of elimination, I ended up with a design using raw vectors. I have not considered using strings because \0 would need to be sampled to offer 8 bits of randomness per character.

Why is generateRawSeeds() an R function?

This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.

In this context, it is sensible to have generateRawSeeds() as an R function. I can use this function in my package functions to generate seeds in a manner that responds to user-specified set.seed(). As an added bonus, it also allows users in interactive sessions to call generateRawSeeds() to seed dqrng's R-visible distribution functions with 64-bit of randomness. (Possibly want to set bits=64 by default).

It is possible to translate this into a C++ function, and just have the R function as a wrapper. This would allow generation of raw-vector seeds from C++ as well, which might be desirable in some settings. I have not done so because... well, I guess I didn't really need it. A direct translation will also have some issues with replacing sample with std::shuffle due to the latter's lack of reproducibility across platforms. ~One could use boost::random::uniform_int_distribution if R's API exposed the raw PRNG... which it doesn't seem to. In scran, I have added my own C++ shuffling algorithm to ensure that I get the same result across platforms, but this was more of a change than I was willing to make here without further approval.~ It should be possible to do something fairly simple with R::unif_rand() * 256 if this is considered to be necessary.

rstub commented 5 years ago

Let's see what's currently implemented:

I see two additional uses cases from what you are describing.

Seeding with more than 32 bits

You are correctly pointing out that seeding with only 32 bits is not very good. I would still prefer it, though, if the user could make some sense from the command. Something like

dqset.seed(c(4711, 42))

could be used to specify two (or more) integers. This is the sort of usage that I am missing when using raw vectors.

The integer vector might then either be used directly or fed to sed_seq_fe to provide the seeds. Alternatively, the bit64package could be used. This provides (signed) 64bit integers for R users by reinterpreting the bit pattern of doubles. Using them is quite simple:

#include <cstdint>
#include <Rcpp.h>

// [[Rcpp::export]]
void printSeed1(Rcpp::NumericVector seed_vec) {
    uint64_t _seed;
    double seed;
    static_assert(sizeof seed == sizeof _seed, "Size does not match!");
    for (int i = 0; i < seed_vec.length(); ++i) {
        seed = seed_vec(i);
        std::memcpy(&_seed, &seed, sizeof seed);
        Rcpp::Rcout << "double: " << seed << std::endl;
        Rcpp::Rcout << "uint64: " << _seed << std::endl;
    }
}

// [[Rcpp::export]]
void printSeed2(Rcpp::NumericVector seed_vec) {
    int64_t _seed;
    double seed;
    static_assert(sizeof seed == sizeof _seed, "Size does not match!");
    for (int i = 0; i < seed_vec.length(); ++i) {
        seed = seed_vec(i);
        std::memcpy(&_seed, &seed, sizeof seed);
        Rcpp::Rcout << "double: " << seed << std::endl;
        Rcpp::Rcout << "int64: " << _seed << std::endl;
    }
}

/*** R
library(bit64)
seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))
seed
unclass(seed)
printSeed1(seed)
printSeed2(seed)
*/

Output:

> library(bit64)

> seed <- as.integer64(c("1234567890987654321", "12345", "-1234567890987654321"))

> seed
integer64
[1] 1234567890987654321  12345                -1234567890987654321

> unclass(seed)
[1]  3.813124e-226  6.099240e-320 -1.107996e+226

> printSeed1(seed)
double: 3.81312e-226
uint64: 1234567890987654321
double: 6.09924e-320
uint64: 12345
double: -1.108e+226
uint64: 17212176182721897295

> printSeed2(seed)
double: 3.81312e-226
int64: 1234567890987654321
double: 6.09924e-320
int64: 12345
double: -1.108e+226
int64: -1234567890987654321

So an integer64 vector is actually a double vector with an additional class and can be moved to C++ as Rcpp::NumericVector. Reinterpreting these doubles as signed 64 bit integers gives us the original interpretation in R. Reinterpreting them as unsigned 64 bit integers gives a different interpretation for negative/large values, but that's more useful when using them as seed.

Overall I would prefer using an IntegerVector, a single integer64 or an integer64 vector as input for dqset.seed.

Seeding using R's RNG as source

You want some randomize function which uses R's RNG to provide you with a new seed for the RNG in use. Getting a random uint32_t from R is quite easy at the C++ level:

uint32_t random32() {
  return R::runif(0, 1) * 4294967296; /* 2^32 */
}

This works correctly for the default Mersenne-Twister. For other RNGs implemented in R, this will probably miss some possible integer values, since the RNG does not provide 32 bits of randomness. I don't see an easy way to solve this, though. So I would find it easiest to have a C++ function which calls R's RNG to provide whatever extended seed structure is used. That structure can then either be returned to R, so that the user can safe the values for reproducing the results in the future. Or it can be fed to the RNGs, just like above.

BTW, I like the idea of an randomize function, since I could use that instead of std::random_device. That's from the random header which one is not supposed to use in R extensions. And the analogue from boost requires linking so that the BH package is not enough. Of course, this would also mean that I should no longer use Mersenne-Twister as default RNG, since using MT (default in R) to generate seeds for MT is considered a bad idea. But then there are other reasons to remove MT: Currently it also comes from the random header. And the version from boost misses a few constexpr. In addition, all other RNGs support multiple streams some way or other, which I would like to make available from R. With MT one can only use poor-man's streams, i.e. consecutive seeds, and hope for the best ...

This is motivated by the usage of PRNGs in R packages. As a R package developer, I use dqrng's PRNGs in my C++ code, seeded to ensure reproducible execution. However, I also want my functions to respond to the state of R's seed, so that users running my functions multiple times (e.g., to check the variance due to the calculations) will get different results. A natural solution to both of these constraints is to randomly define seeds in R and then pass those seeds to the C++ code for seeding and further random number generation.

Hmm, I do not quite understand what you mean by that. Do you create a new RNG on every function call?

BTW, is the source code of these two packages on GitHub or similar. I am curious how you are using the package. :-)

LTLA commented 5 years ago

Thanks for the comments @rstub.

Using raw vectors as the seed

I can empathize with your wariness towards raw vectors, as I felt the same way when I first saw them. (What's this 0e f6 a3 nonsense? Must be a segmentation fault!) But I have grudgingly come around to concede that they are useful in some cases. I'll start by pointing out that you can directly create raw-vector seeds:

dqset.seed(as.raw(c(0x10, 0x20, 0x30, 0x40)))

This is about as interpretable as a integer-vector seed; the values aren't very user-friendly in either case. Yes, it's more typing, but someone going for an easy manual initialization should just supply an integer scalar directly; they'll never do it enough times to run into statistical biases from the limited state space.

If manual usability is critical, integer vectors are probably better than the bit64 approach. The latter requires an additional Depends and doesn't allow us to avoid using vectors (e.g., if you're seeding one of the PRNGs with 128 bits of state in your package's C++ code). As you say, sampling these integers across their full range would be relatively easy in C++, though passing them to R would require some care with the signed/unsigned transition. And I, at least, do want some method of getting the random seeds in my R session - see below.

I'm also surprised that bit64 involves reinterpreting a double like that, I thought that would be implementation-specific behaviour. Maybe it's always the same on the platforms on which R compiles...

Using R's RNG to define the seed

The clearest example is probably in DropletUtils in the emptyDrops function. This operates on single-cell genomics data where the C++ code loops across cells and does a series of Monte Carlo simulations to compute a per-cell p-value. Normally, I would just use R's PRNG within C++, i.e., initialize the PRNG at the start of the C++ function and run through all cells without resetting the PRNG or anything.

The problem is that this code is potentially parallelized at the R-level. This is achieved via the the bplapply call (from the BiocParallel package), with a variety of parallelization schemes defined by the BPPARAM argument. This means that I get a different series of random numbers depending on the parallelization scheme, i.e., number of workers and how the jobs are partitioned across those workers. Obviously, this is not desirable as I would like to get the same results regardless of the chosen parallelization.

The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by set.seed() and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers.

This approach does, however, rely on a lot of re-seeding, which means that statistical biases may manifest in the random numbers if I can only seed a part of the state space. Which brings us to #10 and this issue.

LTLA commented 5 years ago

If raw vectors are too unpalatable, I just did a refactor to use integer vectors - this is on the integerseed branch of my fork here. As anticipated, some work was required to guarantee a sensible uint32_t to int conversion, which would otherwise be implementation-defined if we left it to a static_cast. The relevant file is convert_seed.h, which contains all the new goodies.

rstub commented 5 years ago

Thanks for your comments, @LTLA. You almost had me convinced that raw vectors are a good idea. However, your code triggered my urge to experiment with Melissa O'Neil's seed_seeq_fe mentioned above. And I think this looks quite promising:

Example:

#include <Rcpp.h>
#include "randutils.hpp"

int32_t R_random_32 () {
    constexpr double upper_limit=2147483648;
    double val=R::unif_rand() * upper_limit * 2 - upper_limit;
    if (val >= upper_limit || val < -upper_limit) { val=-upper_limit; } // Absolutely avoid over-/underflow.
    return static_cast<int32_t>(val);
}

// [[Rcpp::export]]
Rcpp::IntegerVector getParam() {
  Rcpp::IntegerVector raw_seeds(4);
  std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32);
  randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end());

  Rcpp::IntegerVector param(4);
  seeder.param(param.begin());
  return param;
}

// [[Rcpp::export]]
Rcpp::IntegerVector getSeeds(int n,
                 Rcpp::Nullable<Rcpp::IntegerVector> param = R_NilValue) {
  Rcpp::IntegerVector raw_seeds(4);
  if (param.isNull()) {
    std::generate(raw_seeds.begin(), raw_seeds.end(), R_random_32);
  } else {
    raw_seeds = param.as();
  }
  randutils::seed_seq_fe128 seeder(raw_seeds.begin(), raw_seeds.end());

  Rcpp::IntegerVector seeds(n);
  seeder.generate(seeds.begin(), seeds.end());
  return seeds;
}

/*** R
set.seed(42)
getSeeds(6)
getSeeds(6)

set.seed(42)
getSeeds(6)
getSeeds(6)

set.seed(42)
one <- getParam()
two <- getParam()
getSeeds(6, one)
getSeeds(6, two)

getSeeds(6, 42)
getSeeds(6, c(4711, 42))
 */

Output:

>set.seed(42)
>getSeeds(6)
[1] -1368039275 -1100785627  1327213203  1033422429  1573423832   475391043
> getSeeds(6)
[1]  909206077 1132750064  872894922  151901752 1480135365 -947383956

> set.seed(42)
> getSeeds(6)
[1] -1368039275 -1100785627  1327213203  1033422429  1573423832   475391043
> getSeeds(6)
[1]  909206077 1132750064  872894922  151901752 1480135365 -947383956

> set.seed(42)
> one <- getParam()
> two <- getParam()
> getSeeds(6, one)
[1] -1368039275 -1100785627  1327213203  1033422429  1573423832   475391043
> getSeeds(6, two)
[1]  909206077 1132750064  872894922  151901752 1480135365 -947383956

> getSeeds(6, 42)
[1]  -850130249 -1625411987  2046530742  -713526308  1691623607  2099784219
> getSeeds(6, c(4711, 42))
[1]  1485050188  -994439850   532523273 -1671404795 -1234644656   810452082

Right now I am unsure, though, how that handles the unsigned to signed conversion. So the above results might be implementation specific (at least until C++20 is used as I just learned). BTW, did you consider std::memcopy for that?

Concerning you parallel RNG usage:

The solution is to seed a fresh PRNG for each cell. This guarantees that I get the same results regardless of how the jobs (one job per cell) are distributed across workers, as each cell's series of random numbers are unrelated to that of another cell's. Moreover, I'm using PCG32 with different streams, which provides an added level of reassurance with respect to statistical independence. The construction of the PRNG is done within the cell-by-cell loop in C++, but the generation of the seeds is still performed in the R code where (i) it can be affected by set.seed() and (ii) it is still serially executed and thus avoids weird changes in seeds in workers under some parallelization schemes, e.g., those using cluster job schedulers.

I am not sure that this is the way PCG should be used in parallel. My understanding is that one would use the same seed for all the RNGs. The different workers would then do one of two things:

In both cases the random numbers generated should be independent, although skipping ahead is probably unpractical given the (rather short) period of pcg32. Using different seeds is similar to skipping ahead, but one does not know how many numbers one can draw. You are mitigating this using in addition different streams. But the different streams alone should be enough.

LTLA commented 5 years ago

Regarding seed_seq_fe: I don't have much of an opinion here. It seems cool but I was happy to rely on R's MT stream being "random enough" for seeding purposes. Happy to go with whatever you decide.

BTW, did you consider std::memcopy for that?

Briefly. But I didn't know whether a memcpy from an unsigned to signed type would behave in a standard way, and I didn't want to take responsibility for introducing difficult-to-track implementation-specific behaviour.

But the different streams alone should be enough.

That's probably true. I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?) It would make it slightly easier for me if I just had to create one PCG seed vector in my functions.... but not that much, and it would have to be done in the serially executed part of R anyway.

LTLA commented 5 years ago

Edit, 5 hours later: I spent some time thinking about the use of seed_seq_fe here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding. There's no need to complicate the dqrng API with extra work and bug potential, e.g., signed vs. unsigned, how to guarantee randomness/full state space exploration when more than four integers are requested.

Perhaps the use of the seed sequence would be useful for handling the case where dqset.seed() is supplied with a single integer. Otherwise, the current changes in #10 (now switched to using integer vectors, as suggested) seem sufficient for all but the most advanced PRNG users.

rstub commented 5 years ago

I spent some time thinking about the use of seed_seq_fe here - and I don't think it's necessary. From what I understand, the seed sequence is intended to "inflate" a small amount of entropy to get enough bits for seeding. But here, R's MT stream is already giving us decent random numbers for seeding.

ACK. It might make sense for seeding MT, but MT support in dqrng is going to be removed anyway.

I don't think it hurts to have different seeds either, but I'm willing to be told otherwise. (Do you know someone on the PCG team who could give an authoritative answer?)

Unfortunately I have no such contact. But I have seen the issue you raised in the GitHub repository. Let's hope you will get an answer there. The blog article you linked is indeed rather unclear. Section 4.3.2 of her paper reads quite different:

Every choice of c [the additive constant in the LCG] results in a different sequence of numbers that has none of its pairs of successive outputs in common with another sequence.

That's why I used constant seed and consecutive stream IDs in the parallel vignette.

rstub commented 5 years ago

Fixed in #10.

LTLA commented 5 years ago

For posterity's sake, I had a go at using dqrng with multiple streams vs multiple seeds:

streamer.cpp

// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>

// [[Rcpp::depends(dqrng)]]
#include "convert_seed.h"
#include "pcg_random.hpp"

// [[Rcpp::export(rng=false)]]
Rcpp::IntegerVector spew_raw_numbers(Rcpp::IntegerVector seed, int stream, int N=10000) {
    pcg32 rng(dqrng::convert_seed<uint64_t>(seed), stream);
    Rcpp::IntegerVector output(N);

    for (auto& o : output) {
        uint32_t sampled=rng();

        // Converting uint to int.
        constexpr uint32_t max_int=2147483647;
        if (sampled <= max_int) {
            o=static_cast<int>(sampled);
        } else {
            constexpr uint32_t max_uint=-1;
            o=-static_cast<int>(max_uint - sampled) - 1;
        }
    }

    return output;
}

R code

Rcpp::sourceCpp("streamer.cpp")

collected0 <- collected1 <- collected2 <- numeric(1000)
for (i in seq_along(collected1)) {
    out <- dqrng::generateSeedVectors(2)

    x1_1 <- spew_raw_numbers(out[[1]], 1, 20000)
    x1_1a <- head(x1_1, 10000)
    x1_1b <- tail(x1_1, 10000)

    x2_2 <- spew_raw_numbers(out[[2]], 2)
    x1_2 <- spew_raw_numbers(out[[1]], 2)

    # Switch functions to test different types of correlations:
#    FUN <- cor
#    FUN <- function(...) cor(..., method="spearman")
    FUN <- function(...) max(abs(ccf(..., plot=FALSE)$acf))

    collected0[i] <- FUN(x1_1a, x1_1b)
    collected1[i] <- FUN(x1_1a, x2_2)
    collected2[i] <- FUN(x1_1a, x1_2)
}

mean(abs(collected0))
mean(abs(collected1))
mean(abs(collected2))

max(abs(collected0))
max(abs(collected1))
max(abs(collected2))

Yeah, it's not very sophisticated, but it's the best I've got without further knowledge of the failure modes of this PRNGs. I really wish someone would chime in on imneme/pcg-cpp#49...

tl;dr

For the amount of random numbers I'm generating, there are no observable differences in mean or maximum correlation from a jump-ahead approach on a single stream compared to multiple streams with the same seed or multiple streams with different seeds. So both our approaches are probably fine. Probably.