dhardy / rand

http://doc.rust-lang.org/rand
Other
2 stars 2 forks source link

Shootout: small, fast PRNGs #52

Open dhardy opened 6 years ago

dhardy commented 6 years ago

We've already had a lot of discussion on this. Lets summarise algorithms here.

This is about small, fast PRNGs. Speed, size and performance in tests like PractRand and TestU01 is of interest here; cryptographic quality is not.


Edit: see @pitdicker's work in #60.

vks commented 6 years ago

I have some benchmarks:

test rand_bytes_aes           ... bench:   8,389,735 ns/iter (+/- 273,533) = 124 MB/s              
test rand_bytes_chacha        ... bench:   3,691,665 ns/iter (+/- 101,794) = 284 MB/s              
test rand_bytes_isaac         ... bench:   2,043,659 ns/iter (+/- 93,872) = 513 MB/s               
test rand_bytes_isaac64       ... bench:   1,272,982 ns/iter (+/- 79,964) = 823 MB/s               
test rand_bytes_splitmix      ... bench:     245,919 ns/iter (+/- 12,513) = 4263 MB/s              
test rand_bytes_xoroshiro     ... bench:     334,576 ns/iter (+/- 24,947) = 3134 MB/s              
test rand_bytes_xoroshirostar ... bench:     430,575 ns/iter (+/- 64,488) = 2435 MB/s              
test rand_bytes_xorshift      ... bench:   1,306,342 ns/iter (+/- 95,741) = 802 MB/s               
test rand_bytes_xorshift1024  ... bench:     364,495 ns/iter (+/- 21,676) = 2876 MB/s              
test rand_u64_aes             ... bench:   6,385,340 ns/iter (+/- 289,658) = 125 MB/s              
test rand_u64_chacha          ... bench:   2,224,053 ns/iter (+/- 130,170) = 359 MB/s              
test rand_u64_isaac           ... bench:   1,038,451 ns/iter (+/- 93,778) = 770 MB/s               
test rand_u64_isaac64         ... bench:     370,893 ns/iter (+/- 27,691) = 2156 MB/s              
test rand_u64_splitmix        ... bench:     108,328 ns/iter (+/- 6,140) = 7384 MB/s               
test rand_u64_xoroshiro       ... bench:      85,963 ns/iter (+/- 4,069) = 9306 MB/s               
test rand_u64_xoroshirostar   ... bench:      98,519 ns/iter (+/- 4,330) = 8120 MB/s               
test rand_u64_xorshift        ... bench:     212,814 ns/iter (+/- 6,618) = 3759 MB/s               
test rand_u64_xorshift1024    ... bench:     274,191 ns/iter (+/- 7,599) = 2917 MB/s

However, I don't trust them that much, because I experienced systematic differences just by reordering the benchmarks.

dhardy commented 6 years ago

Added XorShiftRng. PractRand results:

$ target/debug/examples/cat_rng xorshift | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0xa4893a88
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0xa4893a88
length= 64 megabytes (2^26 bytes), time= 3.3 seconds
  Test Name                         Raw       Processed     Evaluation
  BRank(12):256(2)                  R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  BRank(12):384(1)                  R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  BRank(12):512(2)                  R=+11541  p~=  2e-3475    FAIL !!!!!!!!  
  BRank(12):768(1)                  R=+13672  p~=  1e-4116    FAIL !!!!!!!!  
  BRank(12):1K(1)                   R=+19183  p~=  1e-5775    FAIL !!!!!!!!  
  [Low1/8]BRank(12):128(2)          R= +1312  p~=  5.4e-396   FAIL !!!!!!!   
  [Low1/8]BRank(12):256(2)          R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low1/8]BRank(12):384(1)          R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  [Low1/8]BRank(12):512(1)          R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low4/32]BRank(12):256(2)         R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low4/32]BRank(12):384(1)         R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  [Low4/32]BRank(12):512(1)         R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(2)         R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  ...and 125 test result(s) without anomalies
dhardy commented 6 years ago

@vks quite possibly you should call black_box on your buf after this line.

But anyway this is about much more than just speed. Which in particular do you think I should test?

vks commented 6 years ago

I think the most interesting ones are xoroshiro128+ and xorshift1024*, as suggested by Vigna. I don't have an opinion on cryptographic generators.

It is important to understand what the binary rank test does when judging the failures, see http://xoroshiro.di.unimi.it/:

In all papers describing these generators, the authors (including Marsaglia itself) consider the failure of the binary-rank test irrelevant from a practical viewpoint, as the output of the generator is used as a number, not as a vector on Z/2Z.

(I did not look up Marsaglia's paper to verify this claim.)

dhardy commented 6 years ago

I adapted your crate to use rand_core and ran some tests.

Interestingly SplitMix64 does very well on PractRand. I should look at running TestU01 too.

In light of your arguments about binary rank tests, it might be interesting to try PractRand with those disabled (if it has other useful tests). But I am not convinced we should ignore them entirely.

Test output:

14:40 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng SplitMix64 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0x3be4c794
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0x3be4c794
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  no anomalies in 162 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 512 megabytes (2^29 bytes), time= 7.2 seconds
  no anomalies in 171 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 1 gigabyte (2^30 bytes), time= 14.5 seconds
  no anomalies in 183 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 2 gigabytes (2^31 bytes), time= 28.5 seconds
  no anomalies in 194 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 4 gigabytes (2^32 bytes), time= 55.7 seconds
  no anomalies in 203 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 8 gigabytes (2^33 bytes), time= 111 seconds
  no anomalies in 215 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 16 gigabytes (2^34 bytes), time= 219 seconds
  no anomalies in 226 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 32 gigabytes (2^35 bytes), time= 434 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +4.9  p =  2.8e-3   unusual          
  ...and 234 test result(s) without anomalies

^C
14:50 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng XoroShiro128 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0xa6ddb56d
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0xa6ddb56d
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]BRank(12):384(1)         R= +1272  p~=  5.4e-384   FAIL !!!!!!!   
  [Low1/32]BRank(12):512(1)         R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  ...and 160 test result(s) without anomalies

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 32, message: "Broken pipe" } }', /checkout/src/libcore/result.rs:860:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
14:50 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng XorShift1024 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0x1575d306
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0x1575d306
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  no anomalies in 162 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 512 megabytes (2^29 bytes), time= 7.2 seconds
  no anomalies in 171 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 1 gigabyte (2^30 bytes), time= 14.6 seconds
  no anomalies in 183 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 2 gigabytes (2^31 bytes), time= 28.9 seconds
  no anomalies in 194 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 4 gigabytes (2^32 bytes), time= 56.0 seconds
  no anomalies in 203 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 8 gigabytes (2^33 bytes), time= 112 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]FPF-14+6/16:cross         R=  +4.7  p =  2.7e-4   unusual          
  ...and 214 test result(s) without anomalies

rng=RNG_stdin, seed=0x1575d306
length= 16 gigabytes (2^34 bytes), time= 221 seconds
  no anomalies in 226 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 32 gigabytes (2^35 bytes), time= 436 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/32]BCFN(2+1,13-0,T)         R=  +7.4  p =  1.7e-3   unusual          
  [Low1/32]BRank(12):3K(1)          R=+10916  p~=  3e-3287    FAIL !!!!!!!!  
  ...and 233 test result(s) without anomalies

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 32, message: "Broken pipe" } }', /checkout/src/libcore/result.rs:860:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
vks commented 6 years ago

Splitmix64 is recommended by Vigna for seeding xoroshiro128+ and xorshift1024*. It works with any seed (even zero), but it has a period of 2^64 which might be a bit short.

pitdicker commented 6 years ago

I have done a lot of experiments on Xoroshiro128+, SplitMix etc.

Xoroshiro128+ is always the fastest, with a margin of up to 20%. But it has several weaknesses, and should be used with a lot of care unless you use it to generate f32's of f64's.

I personally like truncated Xorshift*, and PCG. I have optimized and well-tested implementations here that are mostly ready. They are flawless under PractRand and TestU01.

Benchmarks from a month ago:

2017-10-15 8:30 x86_64
test gen_u32_chacha            ... bench:      10,821 ns/iter (+/- 807) = 369 MB/s
test gen_u32_isaac             ... bench:       4,192 ns/iter (+/- 4) = 954 MB/s
test gen_u32_isaac64           ... bench:       4,238 ns/iter (+/- 23) = 943 MB/s
test gen_u32_pcg               ... bench:       1,223 ns/iter (+/- 10) = 3270 MB/s
test gen_u32_pcg64             ... bench:       1,940 ns/iter (+/- 7) = 2061 MB/s
test gen_u32_xorshift          ... bench:       1,084 ns/iter (+/- 8) = 3690 MB/s
test gen_u32_xorshiftmult      ... bench:       1,127 ns/iter (+/- 3) = 3549 MB/s
test gen_u32_xorshiftmult32    ... bench:       1,121 ns/iter (+/- 2) = 3568 MB/s
test gen_u32_xorshiftmult64_32 ... bench:       1,686 ns/iter (+/- 18) = 2372 MB/s
test gen_u32_truncated_xorshift64star   ... bench:       1,720 ns/iter (+/- 7) = 2325 MB/s
test gen_u32_xoroshiro128plus           ... bench:       1,083 ns/iter (+/- 5) = 3693 MB/s
test gen_u32_xorshift128                ... bench:       1,000 ns/iter (+/- 4) = 4000 MB/s
test gen_u32_xorshift128plus            ... bench:       1,110 ns/iter (+/- 6) = 3603 MB/s

test gen_u64_chacha            ... bench:      21,445 ns/iter (+/- 95) = 373 MB/s
test gen_u64_isaac             ... bench:       8,123 ns/iter (+/- 56) = 984 MB/s
test gen_u64_isaac64           ... bench:       4,229 ns/iter (+/- 13) = 1891 MB/s
test gen_u64_pcg               ... bench:       3,544 ns/iter (+/- 19) = 2257 MB/s
test gen_u64_pcg64             ... bench:       1,942 ns/iter (+/- 10) = 4119 MB/s
test gen_u64_xorshift          ... bench:       2,639 ns/iter (+/- 5) = 3031 MB/s
test gen_u64_xorshiftmult      ... bench:       1,313 ns/iter (+/- 4) = 6092 MB/s
test gen_u64_xorshiftmult32    ... bench:       3,415 ns/iter (+/- 4) = 2342 MB/s
test gen_u64_xorshiftmult64_32 ... bench:       4,385 ns/iter (+/- 18) = 1824 MB/s
test gen_u64_truncated_xorshift64star   ... bench:       4,522 ns/iter (+/- 18) = 1769 MB/s
test gen_u64_xoroshiro128plus           ... bench:       1,083 ns/iter (+/- 4) = 7386 MB/s
test gen_u64_xorshift128                ... bench:       1,000 ns/iter (+/- 4) = 8000 MB/s
test gen_u64_xorshift128plus            ... bench:       1,111 ns/iter (+/- 3) = 7200 MB/s

2017-10-15 8:30 x86
test gen_u32_chacha            ... bench:      17,217 ns/iter (+/- 59) = 232 MB/s
test gen_u32_isaac             ... bench:       4,238 ns/iter (+/- 26) = 943 MB/s
test gen_u32_isaac64           ... bench:       5,825 ns/iter (+/- 32) = 686 MB/s
test gen_u32_pcg               ... bench:       3,034 ns/iter (+/- 4) = 1318 MB/s
test gen_u32_pcg64             ... bench:      13,564 ns/iter (+/- 162) = 294 MB/s
test gen_u32_xorshift          ... bench:       1,470 ns/iter (+/- 8) = 2721 MB/s
test gen_u32_xorshiftmult      ... bench:       2,923 ns/iter (+/- 9) = 1368 MB/s
test gen_u32_xorshiftmult32    ... bench:       1,729 ns/iter (+/- 2) = 2313 MB/s
test gen_u32_xorshiftmult64_32 ... bench:       2,392 ns/iter (+/- 3) = 1672 MB/s
test gen_u32_truncated_xorshift64star ... bench:       2,715 ns/iter (+/- 11) = 1473 MB/s
test gen_u32_xoroshiro128plus         ... bench:       1,955 ns/iter (+/- 9) = 2046 MB/s
test gen_u32_xorshift128              ... bench:       2,303 ns/iter (+/- 12) = 1736 MB/s
test gen_u32_xorshift128plus          ... bench:       2,434 ns/iter (+/- 8) = 1643 MB/s

test gen_u64_chacha            ... bench:      35,559 ns/iter (+/- 131) = 224 MB/s
test gen_u64_isaac             ... bench:       8,924 ns/iter (+/- 51) = 896 MB/s
test gen_u64_isaac64           ... bench:       5,961 ns/iter (+/- 126) = 1342 MB/s
test gen_u64_pcg               ... bench:       5,926 ns/iter (+/- 504) = 1349 MB/s
test gen_u64_pcg64             ... bench:      14,957 ns/iter (+/- 20,628) = 534 MB/s
test gen_u64_xorshift          ... bench:       4,590 ns/iter (+/- 120) = 1742 MB/s
test gen_u64_xorshiftmult      ... bench:      11,678 ns/iter (+/- 702) = 685 MB/s
test gen_u64_xorshiftmult32    ... bench:       3,983 ns/iter (+/- 49) = 2008 MB/s
test gen_u64_xorshiftmult64_32 ... bench:       5,882 ns/iter (+/- 82) = 1360 MB/s
test gen_u64_truncated_xorshift64star ... bench:       6,250 ns/iter (+/- 53) = 1280 MB/s
test gen_u64_xoroshiro128plus         ... bench:       2,436 ns/iter (+/- 3) = 3284 MB/s
test gen_u64_xorshift128              ... bench:       2,512 ns/iter (+/- 18) = 3184 MB/s
test gen_u64_xorshift128plus          ... bench:       2,932 ns/iter (+/- 11) = 2728 MB/s
pitdicker commented 6 years ago

The Xorshift variants in table form:

algorithm native result u32 u64 quality notes
xorshift128/32 32-bit u32 3693 MB/s 3034 MB/s bad current implementation
xorshift64/32*_truncated 32-bit u32 3561 MB/s 2339 MB/s excellent (-3,6%, -22,9%)
xorshift128/32*_truncated 32-bit u32 3270 MB/s 2536 MB/s excellent (-11,4%, -16,4%)
xorshift128/64*_truncated 64-bit u64 3125 MB/s 6309 MB/s excellent needs 128-bit support
xorshift128/64 64-bit u64 4000 MB/s 8016 MB/s bad
xorshift64* 64-bit u64 2415 MB/s 4828 MB/s good
xorshift64*_truncated 64-bit u32 2324 MB/s 1772 MB/s excellent
xorshift128+ 64-bit u64 3603 MB/s 7207 MB/s passable
xoroshiro128+ 64-bit u64 3690 MB/s 7380 MB/s passable
pitdicker commented 6 years ago

As @vks says the period of SplitMix is much to small to be used as a general PRNG. Unless we also use streams, that come with their own set of problems.

vks commented 6 years ago

I'm not sure xoroshiro's failures of the binary rank test qualify as a weakness in practice, see this discussion.

pitdicker commented 6 years ago

I saw it after my post 😄.

This is from memory, but I ran Xoroshiro+ through test, and disabled the binary rank tests. There were other ones it failed also. I also tested it by reversing the bits, but don't remember the results.

Simply put, it doesn't mix (avalance) its bits enough. Xorshift (or plain Xoroshiro, if that were a thing) have patterns. A simple addition reduces those patterns, but is not enough to remove them. Something like a multiply should be used instead. Using a large state also helps masking those patterns.

Not to say that I don't think Xoroshiro+ can have it's uses. Xorshift has it use too. Especially when converted to floating point, weaker least significant digits should just about never be interesting. But any user of it should be careful.

@dhardy already raised good points. Do you know the impact of the weaker bits on: (1) generating booleans, (2) the ziggurat table lookup, (3) sampling from a range, (4) our custom, high precision conversions to double? And that are just a few uses in rand. Now it may be useful to handle weaker RNG's nicely anyway in the case of rand.

For general use, I think there are better PRNG's we can go with. Then we can have a simple guideline: if you need a good RNG but don't need to worry about predictability (e.g. no adversaries), use Xorshift* or PCG. If you need unpredictability, use a cryptographic RNG.

Of course performance is a thing. But I consider anything that is on average close to / better than our current Xorshift extremely fast and good enough.

vks commented 6 years ago

Xorshift (or plain Xoroshiro, if that were a thing) have patterns.

What kind of patterns? If I understand correctly, all linear generators have patterns, the question is whether they are observable.

Something like a multiply should be used instead.

Why? (Are you basing this on O'Neill's blog post?) It's not clear that this is better. I recently had an email exchange with Sebastiano Vigna about the merits of xoroshiro compared to xoroshiro+. He did some experiments, showing that xoroshiro+ has some Hamming dependencies after a lot of values, in this regard xoroshiro is slightly better. But the lowest two bits of xoroshiro* are the same LSFR, while for xoroshiro+ the second lowest bit is a perturbed LSFR, so in that regard xoroshiro+ is slightly better.

Using a large state also helps masking those patterns.

I'm not sure which patterns you mean, but the linear dependencies don't go away by using a larger state. Large states also have trouble to recover from lots of zeros in the initial state and require more careful seeding.

Do you know the impact of the weaker bits

Well, empirically there seems to be no practical impact whatsoever. All popular RNGs (that is the Mersenne Twister for most programming languages and xorshift for browsers and the current rand crate) are an LSFR in all bits. AFAIK, the only known impact is on the calculation of the rank of random binary matrices. You are saying users should be careful about the linear dependencies. Why?

if you need a good RNG but don't need to worry about predictability (e.g. no adversaries), use Xorshift* or PCG.

Xorshift linear dependencies in the lowest two bits as well. I don't see a reason why to use it over xoroshiro+, unless you are talking about xorshift1024 vs. xoroshiro128+.

Of course performance is a thing. But I consider anything that is on average close to / better than our current Xorshift extremely fast and good enough.

On the other hand, performance is the only reason to use such a generator over a cryptographic one.

pitdicker commented 6 years ago

What kind of patterns? If I understand correctly, all linear generators have patterns, the question is whether they are observable.

Completely agree. And if they are observable depends on how the random numbers are used.

From the first paper by Marsaglia it was noted that is was best to use Xorshift as a base generator, and apply some output function or combine it with a second generator of a different type. All the variants over the years, like XORWOW, XSadd, xorgens, RanQ1, Xorshift, truncated Xorshift, Xorshift+ and Xoroshiro+ differ in their choice of output function.

Xoroshiro+ just trades quality for a little bit of performance.

But the lowest two bits of xoroshiro* are the same LSFR, while for xoroshiro+ the second lowest bit is a perturbed LSFR, so in that regard xoroshiro+ is slightly better.

Ah, yes. I was imprecise. It is true that just multiplying effects the last two bits about as little as addition does. I once made a table of the avalancing effect of a single multiply. What makes a good output function is a multiply, and truncating the result to only the high bits.

AFAIK, the only known impact is on the calculation of the rank of random binary matrices. You are saying users should be careful about the linear dependencies. Why?

I think we have listed a couple of the problematic cases already. Can you have a look at them?

On the other hand, performance is the only reason to use such a generator over a cryptographic one.

The performance is so good already, that the difference between two PRNG algorithms is not much more than the difference caused by using one more register, or having one instruction that can not be run in parallel with another on the processor. There are only 8 or 9 instructions used per new random number (and a couple of mov's), several of which run in parallel.

At such a point it seems to me almost any use of the random number wil take quite a bit longer than generating one. Of course there may be uses where that last cycle can be the difference, and quality is less important. Then Xoroshiro128+ is great!

Note: I will be the last to say good performance is not a good quality. Optimising RNG's is fun ;-)

pitdicker commented 6 years ago

@vks I (we) are guilty of derailing this issue 😄. If you want to reply, could you open a new issue or something and mention my name?

vks commented 6 years ago

Is this discussion off-topic? I thought the shootout was about performance and quality of the generators.

I think we have listed a couple of the problematic cases already. Can you have a look at them?

I did, and my point was that basically everyone uses RNGs in these cases where all bits have linear dependencies. I'm not aware of any problems with this. You keep saying linear dependencies are problematic, but you never say why.

dhardy commented 6 years ago

No, you're not really off-topic, but I do fear the thread will be long and not so easy to read later!

Great that you already did significant research on this @pitdicker; I didn't realise you'd done so much. Have you come across v3b? I noticed it mentioned in a comment thread earlier; can't find a source for it however.

It sounds like there may be reason to include multiple "small fast PRNGs" in rand, but I don't think anyone wants it to become a huge library.

pitdicker commented 6 years ago

You keep saying linear dependencies are problematic, but you never say why.

Okay, let me try to make a list of some of the problems in rand.

1) generating booleans

Current code for generating booleans:

impl Distribution<bool> for Uniform {
    #[inline]
    fn sample<R: Rng+?Sized>(&self, rng: &mut R) -> bool {
        rng.next_u32() & 1 == 1
    }
}

This should change to compare against some other bit than the last one to generate bools that really have a 50% chance.

2) the ziggurat table lookup

The 11 least significant bits of the random number are used to pick the 'layer' of the ziggurat, and the 53 most significant bits as fraction for a f64. The f64 is then multiplied by some value that belongs to the layer.

If the few bits used for the layer are not actually random, this has relatively large effects on the shape of the distribution.

This can have a mostly easy fix: there are only 8 bits needed for the layer, so don't use the 8 least significant bits, but bits 3..10.

3) sampling from a range

In distributions::Range we restrict the generated random numbers to a 'zone', and collapse that zone to the actual range using a modulus.

If we assume an RNG with weak least significant bits, those weaker bits will remain weak also in the much smaller range.

A solution would be to use division instead of a modulus. At least I remember reading that that works.

Without changes we can't really keep the promise of a uniformly distributed range.

4) high precision conversions to double

It turns out this part is ok. We use the 53 least significant bits for the fraction, and the remeaning 11 most significant bits for the exponent. This means only the last bits of the fracion are weak.

It it were reversed, some exponents would occur more often than others. And the effect of that would be huge for floats.

As you see, it takes some nontrivial thinking to see if your algorithm works well with a generator with some weaknesses. Now we could adapt this code to make using it with Xoroshiro+ safe. That seems like a good idea to me anyway.

But what to do when there is some other RNG that happens to have weaker most significant bits? Okay, I have not heard of such. But how much should generic code cater to the weaknesses of one RNG?

dhardy commented 6 years ago

Do linear dependencies in the low bits imply bias (P(1) ≠ 1/2)? I assumed not(?). Do they imply high predictability within these two bits over a very short sequence? As @vks says the weakness may not be so significant.

pitdicker commented 6 years ago

v3b comes from here: http://cipherdev.org/. src

I know nothing about it though.

pitdicker commented 6 years ago

Maybe of interest. An evaluation of Xoroshiro by the author of PractRand, and since I last looked at some replies from Sebastiano Vigna.

pitdicker commented 6 years ago

For fun I wrote a Xoroshiro64+. It is not meant to be used. I had to calculate new constants for a, b and c. Creating a smaller version is useful, because for something like PractRand it is easier to analyse.

#[derive(Clone, Debug)]
pub struct Xoroshiro64PlusRng {
    s0: u32,
    s1: u32,
}

impl SeedFromRng for Xoroshiro64PlusRng {
    fn from_rng<R: Rng>(mut other: R) -> Result<Self, Error> {
        let mut tuple: (u32, u32);
        loop {
            tuple = (other.next_u32(), other.next_u32());
            if tuple != (0, 0) {
                break;
            }
        }
        let (s0, s1) = tuple;
        Ok(Xoroshiro64PlusRng{ s0: s0, s1: s1 })
    }
}

impl Rng for Xoroshiro64PlusRng {
    #[inline]
    fn next_u32(&mut self) -> u32 {
        let s0 = w(self.s0);
        let mut s1 = w(self.s1);
        let result = (s0 + s1).0;

        s1 ^= s0;
        self.s0 = (w(s0.0.rotate_left(19)) ^ s1 ^ (s1 << 13)).0; // a, b
        self.s1 = s1.0.rotate_left(10); // c

        result
    }

    #[inline]
    fn next_u64(&mut self) -> u64 {
        ::rand_core::impls::next_u64_via_u32(self)
    }

    #[cfg(feature = "i128_support")]
    fn next_u128(&mut self) -> u128 {
        ::rand_core::impls::next_u128_via_u64(self)
    }

    fn fill_bytes(&mut self, dest: &mut [u8]) {
        ::rand_core::impls::fill_bytes_via_u32(self, dest)
    }

    fn try_fill(&mut self, dest: &mut [u8]) -> Result<(), Error> {
        Ok(self.fill_bytes(dest))
    }
}

Results (only the last results, al the other failures before are similar):

rng=RNG_stdin32, seed=0x6fb2c44d
length= 32 gigabytes (2^35 bytes), time= 281 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R= +18.5  p =  1.9e-12    FAIL           
  BRank(12):3K(1)                   R=+583.3  p~=  1.2e-176   FAIL !!!!!!    
  BRank(12):4K(2)                   R= +1799  p~=  1.2e-542   FAIL !!!!!!!   
  BRank(12):6K(1)                   R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  BRank(12):8K(1)                   R= +4028  p~=  1e-1213    FAIL !!!!!!!!  
  [Low8/32]BRank(12):768(1)         R=+583.3  p~=  1.2e-176   FAIL !!!!!!    
  [Low8/32]BRank(12):1K(4)          R= +2544  p~=  4e-1354    FAIL !!!!!!!!  
  [Low8/32]BRank(12):1536(1)        R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  [Low8/32]BRank(12):2K(2)          R= +5696  p~=  1e-1715    FAIL !!!!!!!!  
  [Low8/32]BRank(12):3K(1)          R= +6783  p~=  5e-2043    FAIL !!!!!!!!  
  [Low8/32]BRank(12):4K(2)          R=+13490  p~=  8e-4062    FAIL !!!!!!!!  
  [Low8/32]BRank(12):6K(1)          R=+15050  p~=  2e-4531    FAIL !!!!!!!!  
  [Low1/32]BRank(12):128(8)         R= +3598  p~=  2e-2310    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(4)         R= +8055  p~=  4e-4285    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +6783  p~=  5e-2043    FAIL !!!!!!!!  
  [Low1/32]BRank(12):512(4)         R=+19077  p~= 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):768(1)         R=+15050  p~=  2e-4531    FAIL !!!!!!!!  
  [Low1/32]BRank(12):1K(2)          R=+29077  p~=  4e-8754    FAIL !!!!!!!!  
  [Low1/32]BRank(12):1536(1)        R=+31582  p~=  2e-9508    FAIL !!!!!!!!  
  [Low1/32]BRank(12):2K(2)          R=+60252  p~= 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):3K(1)          R=+64648  p~= 0           FAIL !!!!!!!!  
  ...and 159 test result(s) without anomalies
dhardy commented 6 years ago

Another generator and test suite that may be worth looking into: http://gjrand.sourceforge.net/boast.html

nixpulvis commented 6 years ago

I like the idea of having multiple PRNGs, but personally would have an extension crate for them. I feel like Rust should have a canonical implementation for getting a random number.

P.S. Sorry for continuing to recommend more crates, maybe I just love making crates :P.

dhardy commented 6 years ago

As I said in the first post, that discussion is for another issue!

It may be that rand ends up with multiple similar RNGs anyway simply because ARng is included, then BRng is added because it's better, but ARng is not removed because some users are dependent on it (including on it reproducing results exactly).

vks commented 6 years ago

@pitdicker I think you are confused about the properties of LFSRs.

This should change to compare against some other bit than the last one to generate bools that really have a 50% chance.

What makes you think this is not the case for LSFRs? If you look at the distribution, it is fine. (This is not a very hard test, a generator outputting true and false alternatingly would pass it.)

If the few bits used for the layer are not actually random, this has relatively large effects on the shape of the distribution.

What do you mean with "not actually random"? Do you mean the bits don't have a uniform distribution? This is not the case for LSFRs. In fact, they are usually designed to be equidistributed. If there was a large effect on the distribution of the sampled values, no one would be using LSFR-based generators like the Mersenne Twister or xorshiftt.

If we assume an RNG with weak least significant bits, those weaker bits will remain weak also in the much smaller range.

Yes, but I don't think this "weakness" is a problem in practice.

pitdicker commented 6 years ago

Let me emphasise I have nothing against Xoroshiro128+. It is one of the fastest small PRNG's currently in use, and it does not perform terrible on most statistical test.

But there are other RNG's that have up to 80% of the performance of Xoroshiro128+, and that don't have detectable statistical weaknesses. I think the question should be:

@vks You raise a good point by questioning how much statistical weaknesses matter for real applications. That depends on the algorithm it is used in, and requires evaluation for every case. I can't answer it. Do we want every user to ask himself that question?

Again I want to note that 20~25% performance difference sounds like much, but it is a difference of only one or two clock cycles.

Do linear dependencies in the low bits imply bias (P(1) ≠ 1/2)? I assumed not

I think you are confused about the properties of LFSRs

You are right, my bad. It follows patterns, but the chance for one bit to be 1 or 0 should remain about 50%.

pitdicker commented 6 years ago

@dhardy Do you plan on collecting all small RNG's mentioned here in one repository/crate, for easy comparison? That would be an interesting collection. Or just mention them here?

dhardy commented 6 years ago

I'm considering that, but don't know if I will get around to it or not. It's not a priority anyway.

vks commented 6 years ago

You raise a good point by questioning how much statistical weaknesses matter for real applications. That depends on the algorithm it is used in, and requires evaluation for every case. I can't answer it. Do we want every user to ask himself that question?

It also depends on the statistical weakness in question. I looked up the paper where the binary rank test was introduced (Marsaglia, Tsay 1985):

Many Monte Carlo studies, particularly in combinatorics and graph theory, call for random incidence matrices, elements 0 or 1 to represent the absence or presence of some of some property. It is natural to let the rows of such a random matrix be formed by successive computer words, or portions of words, produced by a random number generator.

Shift-register or F(r, s, ⊕) generators are not suitable for such use. In order that the sequence of 1 x n binary vectors β, βT, βT²,... have long period, it is necessary that the first n vectors in the sequence be linearly independent. Thus a binary matrix with rows formed by no or fewer vectors produced by a shift-register matrix will have rank m with probability (1 - 2^(-n))(1 - 2^(1-n))...(1 - 2^(m+1-n)), about 0.30 when m=n and n≥10 or so.

[...]

If the rows of the m x n binary matrix are successive computer words, or portions thereof, produced by a random number generator, then the rank of the matrix should have the distribution given by the probabilities above. Shift-register generators will fail such tests, as will F(r, s, ⊕) generators. Congruential and F(r, s, ±) usually pass.

So the lowest two bits in xoroshiro might give you some problems when generating random incidence matrices. I don't think this is a common use case, as those linear dependencies are the status quo of the currently used RNGs. It seems the authors of xoroshiro thought it was worth the tradeoff (xoroshiro website):

With more operations it could be possible to completely eliminate the linear artifacts, but I consider this a waste of time: as we've seen, some of the most used generators around have linear dependencies in every bit, and this is not considered a problem.

So the question is whether we want to improve on the status quo at the cost of some performance. In my opinion, it is not necessary, because it does not seem that users expect this property from an RNG.

Should we write something like an RFC? All this binary-rank discussion is a bit scattered at the moment...

dhardy commented 6 years ago

Write an RFC for what? If you want to create your own RNG/variant, you're better off doing so yourself, testing, and then getting feedback directly from other PRNG authors, IMO. I don't think there are many Rust people who are really knowledgeable on the subject.

vks commented 6 years ago

No, I don't want to create an RNG, I meant an RFC for choosing the recommended "fast" RNG in rand.

dhardy commented 6 years ago

Oh. Well, go ahead if you think you've already got enough information to write one. I assumed more research / testing was needed; I'm not so knowledgeable on the subject.

pitdicker commented 6 years ago

Another generator and test suite that may be worth looking into: http://gjrand.sourceforge.net/boast.html

Rust version of the generator:

pub struct GjRandRng {
    a: u64,
    b: u64,
    c: u64,
    d: u64,
}

impl SeedFromRng for GjRandRng {
    fn from_rng<R: Rng>(mut other: R) -> Result<Self, Error> {
        Ok(GjRandRng{ a: other.next_u64(),
                      b: other.next_u64(),
                      c: other.next_u64(),
                      d: other.next_u64()})
    }
}

impl Rng for GjRandRng {
    #[inline]
    fn next_u32(&mut self) -> u32 {
        self.next_u64() as u32
    }

    #[inline]
    fn next_u64(&mut self) -> u64 {
        let mut a = self.a;
        let mut b = self.b;
        let mut c = self.c;
        let mut d = self.d;

        // Crank
        b = b.wrapping_add(c);
        a = a.rotate_left(32);
        c ^= b;

        d = d.wrapping_add(0x55aa96a5);

        a = a.wrapping_add(b);
        c = c.rotate_left(23);
        b ^= a;

        a = a.wrapping_add(c);
        b = b.rotate_left(19);
        c += a;

        b += d;

        self.a = a;
        self.b = b;
        self.c = c;
        self.d = d;

        a
    }

    #[cfg(feature = "i128_support")]
    fn next_u128(&mut self) -> u128 {
        ::rand_core::impls::next_u128_via_u64(self)
    }

    fn fill_bytes(&mut self, dest: &mut [u8]) {
        ::rand_core::impls::fill_bytes_via_u32(self, dest)
    }

    fn try_fill(&mut self, dest: &mut [u8]) -> Result<(), Error> {
        Ok(self.fill_bytes(dest))
    }
}

Benchmark (with a few other rng's for comparison):

test gen_u32_gjrand                   ... bench:       3,332 ns/iter (+/- 17) = 1200 MB/s
test gen_u32_xoroshiro128plus         ... bench:       1,091 ns/iter (+/- 14) = 3666 MB/s
test gen_u32_xorshift128              ... bench:         977 ns/iter (+/- 6) = 4094 MB/s
test gen_u32_xorshift128plus          ... bench:       1,110 ns/iter (+/- 4) = 3603 MB/s
test gen_u64_gjrand                   ... bench:       3,340 ns/iter (+/- 59) = 2395 MB/s
test gen_u64_xoroshiro128plus         ... bench:       1,082 ns/iter (+/- 6) = 7393 MB/s
test gen_u64_xorshift128              ... bench:         979 ns/iter (+/- 9) = 8171 MB/s
test gen_u64_xorshift128plus          ... bench:       1,111 ns/iter (+/- 5) = 7200 MB/s

PractRand does not report any anomalies for the first 64 gigabytes. Edit: also clean after 16 terabytes

pitdicker commented 6 years ago

I have collected several of the small RNGs mentioned in this issue here: https://github.com/pitdicker/small-rngs. 15 at the moment. PR's welcome for more 😄 .

Lokathor commented 6 years ago

Two more PRNGs for folks to throw into their benchmark frameworks because people don't seem to have PCG examples yet (yeah I know that's a lot of labeled constants but it all optimizes down anyway):

/// 64 bits of natural output per step
pub fn pcg_rxs_m_xs_64_64_fixed(state: &mut u64) -> u64 {
    const lcg_magic_mult64: u64 = 6364136223846793005;
    const state_bits: u64 = 64;
    const output_bits: u64 = 64;
    const rxs_select_bits: u64 = 5;
    const used_bits: u64 = output_bits + rxs_select_bits;
    const half_used: u64 = used_bits / 2;
    const rxs_select_shift: u64 = state_bits - rxs_select_bits;
    const rxs_base_shift: u64 = rxs_select_bits + (32 / state_bits);
    const magic_middle_mult64: u64 = 12605985483714917081;
    let xor_64 = |n, bits| n ^ (n>>bits);
    let st = *state;
    *state = st.wrapping_mul(lcg_magic_mult64).wrapping_add(12345);
    let rxs_selection = st >> rxs_select_shift;
    let rxsd = xor_64(st, rxs_base_shift + rxs_selection);
    xor_64(rxsd.wrapping_mul(magic_middle_mult64), half_used)
}

and the slightly faster

/// 32 bits of natural output per step
pub fn pcg_xsh_rr_64_32_fixed(state: &mut u64) -> u32 {
    const lcg_magic_mult64: u64 = 6364136223846793005;
    const state_bits: u64 = 64;
    const output_bits: u64 = 32;
    const rotate_bits: u64 = 5;
    const used_bits: u64 = output_bits + rotate_bits;
    const half_used: u64 = used_bits / 2;
    const discard_shift: u64 = state_bits - used_bits;
    const rotation_shift: u64 = state_bits - rotate_bits;
    let xor_64 = |n, bits| n ^ (n>>bits);
    let st = *state;
    *state = st.wrapping_mul(lcg_magic_mult64).wrapping_add(12345);
    let xord: u64 = xor_64(st, half_used);
    let xord_shifted: u32 = (xord >> discard_shift) as u32;
    let rand_rotate_amount: u32 = (st >> rotation_shift) as u32;
    xord_shifted.rotate_right(rand_rotate_amount)
}
Ichoran commented 6 years ago

I would second the suggestion to take the PCG generators seriously: they have quite good quality, tunable amounts of state, and very good speed.

The one I use for my go-to "fast non-broken PRNG" is (in Scala--I haven't ported it to Rust yet):

// Algorithm taken from PCG generators by Melissa O'Niell (Apache 2 license); RXS M XS 64 variant (one sequence)
final class Pcg64(state0: Long = java.lang.System.nanoTime) extends PrngState64 with Copy[Pcg64] {
  private[this] var myState = state0
  def copy: Pcg64 = new Pcg64(myState)
  def state64 = myState
  def setFrom(l: Long): Boolean = { myState = l; true }
  final def L = {
    myState = (myState * 6364136223846793005L) + 1442695040888963407L
    val l = ((myState >>> ((myState >>> 59) + 5)) ^ myState) * 0xAEF17502108EF2D9L   // 12605985483714917081 base 10
    (l >>> 43) ^ l
  }
}

where L is the method that actually generates a new Long (i.e. u64).

I'm not sure all the constants in the Rust code @Lokathor posted actually help convey how compact the algorithm actually is.

Lokathor commented 6 years ago

Yeah, it's not the most compact rendition. It's from a version of the code that I had where I wanted to show exactly where each magic number is coming from. Any version included in an "official" PCG release would possibly want to clean it up (though the compiler doesn't actually care either way).

One PCG idea I've tinkered a bit with recently is that you can advance the generator the same for any output mode and then pick your permutation based on if next_u32 or next_u64 is being called. On my machine, the 32 bit output permutation runs somewhat faster than the 64 bit output permutation (~2.1ns compared to ~2.8ns), enough of a difference that a person who only needs 32 bits per call might choose to stick to next_u32.

Ichoran commented 6 years ago

A basic u64 -> u64 version might look more like so (straight port from Scala--the additive constant is different than the one you used; not sure why?):

fn pcg64(state: &mut u64) -> u64 {
    let x = (*state).
        wrapping_mul(6364136223846793005u64).
        wrapping_add(1442695040888963407u64);
    *state = x;
    let y = ((x >> ((x >> 59) + 5)) ^ x).
      wrapping_mul(0xAEF17502108EF2D9u64);
    (y >> 43) ^ y
}

Unfortunately, the (necessary) wrapping_mul stuff really obscures the math :(

pitdicker commented 6 years ago

PCG if for now my favourite also.

The RXS M variants are definitely the fastest of all the permutations. And they happen to pass the statistical tests. But using them is not recommended by O'Neil. The problem is that the period is very short (2^64 for PCG RXS M XS 64), and that the number of bits in the output are the same as the size of the state. This makes the generalized birthday problem visible after a tiny amount of generated numbers, even less then the root of the period, 2^32.

We are better of using the 'normal' PCG variants XSL and XSH.

Lokathor commented 6 years ago

@Ichoran If you want to cut it down for clarity, note that the u64 tags on the constants are not necessary. Rust will know what to do already because of the type of the wrapping_foo functions. It's not nearly as stupid as C compilers are.

Also note that you can have the state and inc values be stored as Wrapping values if you want to be able to use the operators and you just unwrap the number into a normal value as you return it.

The additive constant should actually be user-selectable during generator creation in the "full" version of a PCG generator. Using a fixed add is only if you really want to save on space (and you probably don't care that much). Otherwise you should take an inc value at creation, flip the low bit on so that it's odd, and then use that. The only requirement on the inc value is that it's odd, it has no other magic.

Example:

pub struct PCG {
    state: u64,
    inc: u64,
}

impl PCG {

    /// This can be set to any LCG constant. You can check [the Wikipedia
    /// entry](https://en.wikipedia.org/wiki/Linear_congruential_generator#Parameters_in_common_use)
    /// for some commonly used values. Note that not all constants are equally good,
    /// so read the rest of that page carefully if you're trying to come up with
    /// your own. Test whatever value you pick as well. Changing the LCG multiplier
    /// value effectively makes an entirely new generator, it's a much bigger effect
    /// than adjusting the number stream by changing the increment value.
    const lcg_magic_mult64: u64 = 6364136223846793005;

    pub fn new(state: u64, inc: u64) -> Self {
        let inc = inc | 1;
        Self { state, inc }
    }

    fn step_generator(&mut self) {
        self.state = self.state.wrapping_mul(PCG::lcg_magic_mult64).wrapping_add(self.inc);
    }

    // other methods for actual output would exist as well of course, yada yada
}

Also note that the point of running the permutation on the old state while you advance to the new state is that it theoretically allows for Instruction Level Parallelism.

@pitdicker I suppose I'd love to not use the RXS M XS version and all, but if we want to support u64 output how do we do an XSH based version without u128 support in stable rust?

pitdicker commented 6 years ago

I suppose I'd love to not use the RXS M XS version and all, but if we want to support u64 output how do we do an XSH based version without u128 support in stable rust?

That is the problem why I have not yet made a PR 😄 . Lucky, our use of u128 would be completely internal. We could for the moment depend on extprim, or implement u128 multiply and addition ourselves until it is in stable (using inline assembly or u64's).

Ichoran commented 6 years ago

It really would be good to have PCG XSL RR 128/64 (MCG) as the default; it's very nearly as fast (according to O'Niell) as the 64-bit-state one, which gives it a splendid combination of speed and performance. (It has a 2^126 period.)

Regarding u64: I left that on for clarity, though maybe I'm the only one who thinks it's clearer. I hadn't noticed the Wrapping types, though! Should have done that.

Lokathor commented 6 years ago

@Ichoran I'm all for a 128/64 PCG if we can figure out how we want to have out 128 bit types. We'd have to benchmark and such once we got things going if we build it ourselves, because if we spoof 128 bit operations then maybe that doesn't get sent to LLVM properly and then it ends up way slower than actually using 128 bit CPU registers and all that. There are (occasional) downsides to using a language that's only 2 years old.

pitdicker commented 6 years ago

@Ichoran Maybe you like a post in the next issue https://github.com/dhardy/rand/issues/60#issuecomment-346865917. PCG XSL RR 128/64 (MCG) is great on x86_64, but does not do wel on x86 because that architecture can't do 128 bit multiplication well.

Lokathor commented 6 years ago

@pitdicker Hmm, hold on a minute. You say that "The RXS M variants are definitely the fastest of all the permutations", but in my (limited I admit) benchmarking the rxs_m_xs_64_64 version was around 30% slower than xsh_rr_64_32.

Ichoran commented 6 years ago

Slow 128-bit math on 32 bit platforms is an issue. If we can't get around that, we would need something that would be called PCG XSL RR 64/64 (EXT 2), which ought to be around 20% slower than the 128-bit version.

@Lokathor - Were you benchmarking random numbers / second or GB of random data per second? You get different results depending on which metric you use.

Lokathor commented 6 years ago

I was measuring in terms of calls / second.

Since we're offering next_u32 as part of the Rng trait, it's a concern to think about. If you've got next_u64 available and you call next_u32 anyway, then clearly you only need 32 bits to get whatever done, and so you'd want just that to be as fast as possible. Defaulting to the full next_u64 process and then throwing away half the bits is probably a bad choice if we can make next_u32 faster by using an alternate permutation. The 32-bit stream of an Rng already isn't going to match the 64-bit stream, so we might as well make it go as quick as we can.

Ichoran commented 6 years ago

@Lokathor - I think you're just saying that if you ask for a 32 bit output, you can use the 32 bit permutation function instead of the 64 bit one? Yes, absolutely! That's kind of the point of PCG generators: the random number stream is determined by the LCG (or MCG or whatever), but the output is hashed to take full advantage of the randomness delivered by the LCG.

Also, it has the nice property that if you call next_u32 and then next_u64, you get the same second random number as if you called next_u64 twice.

Lokathor commented 6 years ago

Exactly. I have a "full working version" of such a generator written all the way out if you want to take a look.

Ichoran commented 6 years ago

@Lokathor - That looks good except that the use of an extra 8 bytes just to store inc makes the generator considerably less attractive: the period is as short as a generator of half the size, but for cases where very many random number streams need to be kept track of, the needed state is doubled. I'd rather have more state for a default RNG.