Closed rstub closed 5 years ago
On this note: you are already aware of this, but this directly rebukes our method for sampling random 32-bit integers. It might be worth asking R core to expose the Mersenne Twister stream directly (i.e., give us a function that just yields 32-bit integers). This would only be fair if they're not letting us use <random>
.
I would not say "rebukes our method", since sampling in R core and generating random seeds have slightly different constraints: The different RNGs in R produce floating point numbers in (0,1)
(R does not use [0,1)
) in slightly different ways. Most of them generate random integers in [0, MAX]
and divide by MAX + 1
("Wichmann-Hill" is the exception here). For Mersenne-Twister MAX
is UINT32_MAX
, but for other RNGs MAX
is smaller. For the two RNGs from TAOCP MAX
is only 2^30-1!
If one where to multiply a floating point number generated this way with 2^32, this should be no problem when MT (and probably WH) is used. All possible unit32_t
are generated with equal probability. For the other RNGs, there will be a few uint32_t
that cannot be generated. For the two RNGs from TAOCP, many uint32_t
cannot be generated. For sampling in R core, which has to work properly with any supported RNG, this would be disastrous. I think that's why they decided to use two random draws to generate 32 bits.
In our case, I think it is not so bad. For the default MT (and probably WH) the full state space of the RNGs can be reached. For most other RNGs, there are only a few states (between one and about three hundred) that cannot be reached. For the two RNGs from TAOCP the possible state space would be significantly reduced, e.g. 2^60 instead of 2^64 possible states. For the task at hand this still seems to be sufficient from my point of view. Anyway, there are still some thing one could do:
generateSeedVectors
that issues a warning
or message
when an unsuitable RNG is used. R_random_u32
, i.e. generate 32 bits from two random draws.I think 1. would be a good idea and 3. seems unnecessary. I am undecided when it comes to 2.
If the linked PDF is using MT, we would expect a maximum possible bias of ~2-fold in terms of the ratio between the frequencies of the most observed and least observed 32-bit integer. This seems... pretty bad.
Another approach would be to assume that R core will fix R_unif_index
, and use that instead in our R_random_u32
. Then it wouldn't be our problem anymore.
I am not sure where one would get a bias in our case. MT produces any 32 bit integer with equal probability. This gets converted to a double by R, and we convert it back to a 32 bit integer. This conversion is exact, as demonstrated by the following program, which tries the conversion for all possible 32 bit integers:
#include <cstdint>
#include <iostream>
double to_double(uint32_t i) {
return (double)i * 2.3283064365386963e-10; // c.f. MT_genrand
}
uint32_t to_uint(double d) { // c.f. R_random_u32
constexpr double upper_limit = 4294967296;
double val = d * upper_limit;
if (val >= upper_limit) { val=0; } // Absolutely avoid overflow.
return static_cast<uint32_t>(val);
}
int main() {
uint32_t i = 0;
do {
double d = to_double(i);
uint32_t j = to_uint(d);
if (i != j) {
std::cout << i << " and " << j << " are not the same" << std::endl;
return 1;
}
if ( i % (1024*1024*512) == 0 )
std::cout << "Reached: " << i << std::endl;
++i;
} while (i != 0);
std::cout << "All good" << std::endl;
return 0;
}
Output:
Reached: 0
Reached: 536870912
Reached: 1073741824
Reached: 1610612736
Reached: 2147483648
Reached: 2684354560
Reached: 3221225472
Reached: 3758096384
All good
The bias that results from the float to int conversion comes into play when the multiplicative constant does not evenly divide the number of different floating point numbers, c.f. https://blog.daqana.com/en/dqsample-a-bias-free-alternative-to-rs-sample-function/.
Another approach would be to assume that R core will fix
R_unif_index
, and use that instead in ourR_random_u32
. Then it wouldn't be our problem anymore.
That's an excellent idea! R_unif_index
is fixed in R-devel (will be 3.6.0). And for prior versions (or Sample_kind == ROUNDING
), it at least eliminates the problem with the two RNGs from TAOCP, since there already two random draws are used.
If you do decide to go with R_unif_index
, it would be a good idea to check in with R core regarding whether it is an recognized and supported entry point into the C API. I don't see it mentioned in "Writing R extensions", even though it is available from R_ext/Random.h
, so it would be best to confirm its (continued) availability. At the very least, R core will know that we're using it, so we'll get on their radar.
I have asked on r-devel concerning R_unif_index
. No response so far.
fixed in #23 and #24
BTW, R_unif_index
became part of the official API.
Prerequisite for #18