unweighted sampling - Githubissues

rstub commented 5 years ago

Prerequisite for #18

Update the code from https://github.com/daqana/dqrng/tree/feature/sample
Add long-vector support
Add hashing algorithm
- Evaluate https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html
- Evaluate https://stackoverflow.com/a/28287865/8416610

LTLA commented 5 years ago

On this note: you are already aware of this, but this directly rebukes our method for sampling random 32-bit integers. It might be worth asking R core to expose the Mersenne Twister stream directly (i.e., give us a function that just yields 32-bit integers). This would only be fair if they're not letting us use <random>.

rstub commented 5 years ago

I would not say "rebukes our method", since sampling in R core and generating random seeds have slightly different constraints: The different RNGs in R produce floating point numbers in (0,1) (R does not use [0,1)) in slightly different ways. Most of them generate random integers in [0, MAX] and divide by MAX + 1 ("Wichmann-Hill" is the exception here). For Mersenne-Twister MAX is UINT32_MAX, but for other RNGs MAX is smaller. For the two RNGs from TAOCP MAX is only 2^30-1!

If one where to multiply a floating point number generated this way with 2^32, this should be no problem when MT (and probably WH) is used. All possible unit32_t are generated with equal probability. For the other RNGs, there will be a few uint32_t that cannot be generated. For the two RNGs from TAOCP, many uint32_t cannot be generated. For sampling in R core, which has to work properly with any supported RNG, this would be disastrous. I think that's why they decided to use two random draws to generate 32 bits.

In our case, I think it is not so bad. For the default MT (and probably WH) the full state space of the RNGs can be reached. For most other RNGs, there are only a few states (between one and about three hundred) that cannot be reached. For the two RNGs from TAOCP the possible state space would be significantly reduced, e.g. 2^60 instead of 2^64 possible states. For the task at hand this still seems to be sufficient from my point of view. Anyway, there are still some thing one could do:

Document that best results are achieved when R's default RNG MT is used.
Provide an R wrapper for generateSeedVectors that issues a warning or message when an unsuitable RNG is used.
Use R core's method in R_random_u32, i.e. generate 32 bits from two random draws.

I think 1. would be a good idea and 3. seems unnecessary. I am undecided when it comes to 2.

LTLA commented 5 years ago

If the linked PDF is using MT, we would expect a maximum possible bias of ~2-fold in terms of the ratio between the frequencies of the most observed and least observed 32-bit integer. This seems... pretty bad.

Another approach would be to assume that R core will fix R_unif_index, and use that instead in our R_random_u32. Then it wouldn't be our problem anymore.

rstub commented 5 years ago

I am not sure where one would get a bias in our case. MT produces any 32 bit integer with equal probability. This gets converted to a double by R, and we convert it back to a 32 bit integer. This conversion is exact, as demonstrated by the following program, which tries the conversion for all possible 32 bit integers:

#include <cstdint>
#include <iostream>

double to_double(uint32_t i) {
    return (double)i * 2.3283064365386963e-10; // c.f. MT_genrand
}

uint32_t to_uint(double d) { // c.f. R_random_u32
    constexpr double upper_limit = 4294967296;
    double val = d * upper_limit;
    if (val >= upper_limit) { val=0; } // Absolutely avoid overflow.
    return static_cast<uint32_t>(val);
}

int main() {
    uint32_t i = 0;
    do {
        double d = to_double(i);
        uint32_t j = to_uint(d);
        if (i != j) {
            std::cout << i << " and " << j << " are not the same" << std::endl;
            return 1;
        }
        if ( i % (1024*1024*512) == 0 )
            std::cout << "Reached: " << i << std::endl;
        ++i;
    } while (i != 0);
    std::cout << "All good" << std::endl;
    return 0;
}

Output:

Reached: 0
Reached: 536870912
Reached: 1073741824
Reached: 1610612736
Reached: 2147483648
Reached: 2684354560
Reached: 3221225472
Reached: 3758096384
All good

The bias that results from the float to int conversion comes into play when the multiplicative constant does not evenly divide the number of different floating point numbers, c.f. https://blog.daqana.com/en/dqsample-a-bias-free-alternative-to-rs-sample-function/.

Another approach would be to assume that R core will fix R_unif_index, and use that instead in our R_random_u32. Then it wouldn't be our problem anymore.

That's an excellent idea! R_unif_index is fixed in R-devel (will be 3.6.0). And for prior versions (or Sample_kind == ROUNDING), it at least eliminates the problem with the two RNGs from TAOCP, since there already two random draws are used.

LTLA commented 5 years ago

If you do decide to go with R_unif_index, it would be a good idea to check in with R core regarding whether it is an recognized and supported entry point into the C API. I don't see it mentioned in "Writing R extensions", even though it is available from R_ext/Random.h, so it would be best to confirm its (continued) availability. At the very least, R core will know that we're using it, so we'll get on their radar.

rstub commented 5 years ago

I have asked on r-devel concerning R_unif_index. No response so far.

rstub commented 5 years ago

fixed in #23 and #24

BTW, R_unif_index became part of the official API.

daqana / dqrng

unweighted sampling #21