dhardy / rand

http://doc.rust-lang.org/rand
Other
2 stars 2 forks source link

Requirements for a small fast RNG #60

Open pitdicker opened 6 years ago

pitdicker commented 6 years ago

This are my personal requirements for picking a default small PRNG. It seems like we should have some set of requirements to apply to https://github.com/dhardy/rand/issues/52, beyond benchmarks. @dhardy, @vks: What do you think?

A user should not need to worry about:

This translates to the following requirements, in RNG terms:

Other qualities to look for are:

Speed and memory

This seems like a clear requirement. What differentiates small fast RNGs from cryptographic ones, is that they are very fast and / or use much less memory. Between a dozen and up to twenty instructions seem to be the norm. Also a state of about 16 bytes is usually enough, and using more memory is wasting.

Not consuming to many results

What should be the minimum amount of results that can be consumed? On today's hardware, a fast RNG can generate about 2^32 results in just 4 seconds. In most cases it is not reasonable to expect a process to do little else than generate random numbers, at least not without also requiring them to be cryptographically secure. With the exception of simulations maybe. 2^47 results take about a day to be generated. It seems anything that can generate a usable range close to 2^50 seems good enough.

Independently initialized RNGs having a chance of producing similar streams

Many RNGs have only one period. When you pick some random initial state, there is a chance to end up in the same range as the results of another generator. Lets take a period of 2^128, and 2^48 as an upper limit for the expected amount of used results. It then takes about 2^32 random initializations to have a chance of 1 in a million to repeat part of an already used stream. See this comment, although some numbers are off by a factor 1000. So a period of 128 bits seems like a minimum requirement.

An alternative is RNGs with a small period and support for multiple streams, like PCG (when using LCG as a base generator) and SplitMix offer. A period of only 2^64 is just about enough to have a chance of 1 in 2000 for one new RNG to end up within the stream of one other RNG. If the RNG has 2^63 possible streams like PCG, 2^27 initializations are possible before the stream has a change of 1 in 1000 of being the same. Conclusion: support for ~2^64 streams is just enough to make a period of ~2^64 good enough for worry-free initialization.

All the above relies on the RNG taking a seed that has as many bits as the log2 of the period, 128. If the RNG can only take less bits, initialization may still be worry-free if the RNG has a proven method to map that seed to a state that can not be in the usable range of another seed. An example could be initializing PCG with a fixed state, and a random u64 for the stream. With just a u64 it is possible to guarantee 2^22 initializations and still having a change of only 1 in a million of ending up in the same stream. I am not aware of any RNG that offers this, besides it being quite pointless.

For small chaotic RNGs it is not really possible to talk about periods. The sizes calculated above for periods should map to state sizes if the RNG is perfect. So the state of chaotic RNGs should be at least 128 bits. I am not yet clear on what guarantees are available for this category of RNGs. Especially how cycles effect the birthday problem when initializing.

Whether statistical biases it contains may or may not matter

The standard way to evaluate the statistical quality of RNGs is through test suites, with PractRand and the BigCrush test from TestU01 being the current state of the art.

Many small chaotic RNGs pass without problems, and are quite fast.

For RNG's with a fixed period the recipe seems to be: take a fast, low quality RNG as a base generator, and apply a permutation function. Only the simple base generators like Xorshift, LCG or a Wehl sequence fail quickly.

Whether a statistical biases of an RNG matter, depends on how the generated values get used. For many uses a little bias of patterns does not matter. On the other hand, when it does matter but someone does not realize so, it is very hard to notice. The values appear random after all.

RNGs that have a clear bill from BigCrush and PractRand can be used without worry. At least for all situations where there is no adversary that tries to predict the RNG, use cryptographic RNGs in such a case.

Not being trivially predictable

On the PCG blog a post about trivial predictability is worth a read. For some algorithms cryptographically secure PRNG are unnecessary heavy, while some difficulty predicting the next results is enough to deter adversaries. One example is randomized quicksort. Not being trivially predictable can be seen as a plus, but not a requirement.

Reproducable results also on architectures with a different endianness

Even more then cryptographic RNGs, simple RNGs are used in many situations that need reproducible results. Then endianness should not be a worry.

Reasonably good theoretical basis

If the RNG has at least some theoretical background, requirements like the period size or cycle length can be proven. I would say we need those guarantees for the RNG do be made a default in Rand. While a published paper is nice, even comments in the source code with math that makes sense seem good enough for me.

License that can be used in Rand.

Our licenses are MIT and Apache 2.0. For RNGs that means they should have compatible licenses, like MIT or public domain. Anything higher like BSD or GPL would need permission from the author.

pitdicker commented 6 years ago

Evaluation of Bob Jenkins small fast RNG.

This falls in the category of chaotic RNGs. It is reported to pass PractRand and BigCrush. There exist three variants: a 32-bit version, a 64-bit version, and 32-bit version with three rotations. License is public domain.

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_jsf32               ... bench:       2,229 ns/iter (+/- 1) = 1794 MB/s
test gen_u32_jsf64               ... bench:       2,312 ns/iter (+/- 4) = 1730 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_jsf32               ... bench:       3,078 ns/iter (+/- 3) = 2599 MB/s
test gen_u64_jsf64               ... bench:       2,311 ns/iter (+/- 5) = 3461 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_jsf32               ... bench:       2,680 ns/iter (+/- 34) = 1492 MB/s
test gen_u32_jsf64               ... bench:       7,595 ns/iter (+/- 104) = 526 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_jsf32               ... bench:       3,838 ns/iter (+/- 63) = 2084 MB/s
test gen_u64_jsf64               ... bench:       7,661 ns/iter (+/- 73) = 1044 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

The performance of jsf is not great. Note that I did not optimise it heavily. But the baseline Xorshift128/32 is not all that fast already, and jsf has in many situations only ~50% the performance of that.

The state size is 128 bits for the 32-bit variant, and 256 bits for the 64-bit variant.

Initialization and cycles. Officially a seed of only 32 bits is supported, or 64 bits for the 64-bit version. For this RNG short cycles are possible, even of length 1. For the 2^32 possible seeds of the 32-bit version, okay-ish tests are done to make ensure they all do not result in a short cycle (with short being 2^20 rounds).

PractRand includes a variant that can take 128 bits as seed. This opens up the implementation to short cycles, of which only 6 are known until know. But only cycles of length 0 are explored, there could be many more. So adjusting the initialization routine to take a larger seed does not seem wise.

There is some probabilistic guessing around 32-bit seeds and how long it takes for one seed to run into another seed, but it does not sound very convincing to me.

The shortest cycle is expected to be 2^94, but this is unproven.

Conclusion Based on performance, little theoretical background, and very small seeds jsf would not be a good choice for a default small RNG.

pitdicker commented 6 years ago

Evaluation of a small RNG by Geronimo Jones

I am not sure what the basis of this RNG is. It is reported to pass PractRand and BigCrush. The RNG is mainly designed for 64-bit architectures. License is GPL v2 or v3.

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_gj                  ... bench:       3,333 ns/iter (+/- 3) = 1200 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_gj                  ... bench:       3,333 ns/iter (+/- 3) = 2400 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_gj                  ... bench:       8,787 ns/iter (+/- 118) = 455 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_gj                  ... bench:       8,812 ns/iter (+/- 125) = 907 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Performance is pretty bad, especially on 32-bit architectures. I am not sure why, because it does only a few operations, that should even be somewhat parallelizable. The assembly looks plausible to me, but LLVM tries to fit 4 u64 into three registers.

Initialization and period. gjrand provides various initialization routines, taking 32 bits, 64, 128, and 128 + a nonce as seeds. From the authors site: "For any seed the period is guaranteed a multiple of 2^64. The sequence for a seed will not 'run into' the sequence for another for at least 2^64 steps." "Both design and testing aim for different seeds getting different sequences with an extremely good degree of statistical independence." This should equal about 2^27 random initializations before the stream has a change of 1 in a million of being the same. This is a minimal guarantee, the actual number of initializations should be higher.

Conclusion GJ seems good, practical, and can 'tick most of boxes'. The GPL license is a problem, but we could ask for permission. But there are other RNGs with better performance than gj that might be better as a default small RNG.

pitdicker commented 6 years ago

Evaluation of a Small Fast Counting RNG

The sfc RNG is designed by Chris Doty-Humphrey, the author of PractRand. It is reported to pass PractRand and BigCrush. There are variants of the RNG for 16-, 32- and 64-bit architectures. License is public domain.

It is described as: "It's basically a good small chaotic RNG driven by a bad smaller linear RNG. The combination gives it the strengths of each — good chaotic behavior, but enough structure to avoid short cycles."

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_sfc_32              ... bench:         980 ns/iter (+/- 4) = 4081 MB/s
test gen_u32_sfc_64              ... bench:       1,221 ns/iter (+/- 1) = 3276 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_sfc_32              ... bench:       3,218 ns/iter (+/- 9) = 2486 MB/s
test gen_u64_sfc_64              ... bench:         980 ns/iter (+/- 3) = 8163 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_sfc_32              ... bench:       2,434 ns/iter (+/- 7) = 1643 MB/s
test gen_u32_sfc_64              ... bench:       4,125 ns/iter (+/- 14) = 969 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_sfc_32              ... bench:       3,568 ns/iter (+/- 15) = 2242 MB/s
test gen_u64_sfc_64              ... bench:       4,914 ns/iter (+/- 17) = 1628 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Sfc has extremely good performance on 64-bit architectures. Of the RNG's I have collected, the 64-bit version is actually the fastest RNG to generate u64's, and the 32-bit version the fastest for u32. Only on 32-bit architectures there are Xorshift variants that can outperform it, but only one that also does not fail PractRand and BigCrush.

Also an advantage over some other PRNGs is that it does not do multiplication, which can be slow on some architectures.

The state includes 3 u32's (u64 for the 64-bit variant) and a counter.

Initialization and period. Initializing the 32-bit variant takes 96 bits. Thanks to an additional 32-bit counter it takes a lot of initializations before the stream has a change of 1 in a million of being the same. I suspect at least 2^38. The 64-bit variant takes 196 bits as seed.

The 32-bit variant has 2^128 states, the cycle length should on average be 2^127. But only a minimum of 2^32 is guaranteed, so consuming more than 2^32 results is maybe not advisable.

The 64-bit variant has 2^256 states, with a cycle length of on average 2^255. A minimum cycle length of 2^64 is guaranteed, thanks to the counter. That makes it good enough for any use.

Conclusion The 64-bit version of sfc is great!

The minimum guaranteed cycle length of the 32-bit version is to short, but I wonder if that guarantee is not overly conservative.

On 32-bit architectures there may be faster RNGs.

pitdicker commented 6 years ago

Evaluation of Xorshift

The Xorshift RNG is designed by George Marsaglia. It performs spectacularly bad on PractRand and BigCrush, and should really be used with an output function. There are variants for 32- and 64-bit architectures, but based on the design principles more are possible. License is public domain.

Speed: Benchmarks from my pc. x86_64:

test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s
test gen_u32_xorshift_128_32_opt ... bench:         975 ns/iter (+/- 2) = 4102 MB/s
test gen_u32_xorshift_128_64     ... bench:         977 ns/iter (+/- 3) = 4094 MB/s

test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s
test gen_u64_xorshift_128_32_opt ... bench:       2,727 ns/iter (+/- 3) = 2933 MB/s
test gen_u64_xorshift_128_64     ... bench:         979 ns/iter (+/- 3) = 8171 MB/s

x86:

test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s
test gen_u32_xorshift_128_32_opt ... bench:       1,471 ns/iter (+/- 39) = 2719 MB/s
test gen_u32_xorshift_128_64     ... bench:       2,567 ns/iter (+/- 59) = 1558 MB/s

test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s
test gen_u64_xorshift_128_32_opt ... bench:       4,575 ns/iter (+/- 44) = 1748 MB/s
test gen_u64_xorshift_128_64     ... bench:       2,638 ns/iter (+/- 91) = 3032 MB/s

Performance is extremely good, and can even be improved a little by not doing the three xorshifts in order. They can be parallelised by doing the first xorshift of the next round while doing the second two xorshifts of the current round.

Memory and period Xorshift comes with an extension mechanism, which makes it possible to use all kinds of state sizes, and corresponding periods. A state size of 128 bits, with a corresponding period of 2^128-1 is a good choice. I don't get the use of 1024 and 4096-bit state sizes.

Conclusion This RNG is not included for serious consideration, the variants that apply various output functions are more interesting.

pitdicker commented 6 years ago

Evaluation of Xorshift128+ and Xoroshiro128+

Xorshift+ is an improvement over Xorshift and XSadd by Sebastiano Vigna. Xoroshiro+ is again an improvement over Xorshift+ by David Blackman and Sebastiano Vigna. Although there are claims they pass BigCrush, it actually fails that and PractRand. The author of PCG visualizes Xoroshiro, and it seems they are not such a big improvement over Xorshift as they are sold. The RNG's are optimised for 64-bit architectures. License is public domain.

Speed: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_xorshift_128_plus   ... bench:       1,186 ns/iter (+/- 3) = 3372 MB/s
test gen_u32_xoroshiro_128_plus  ... bench:       1,196 ns/iter (+/- 1) = 3344 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_xorshift_128_plus   ... bench:       1,186 ns/iter (+/- 3) = 6745 MB/s
test gen_u64_xoroshiro_128_plus  ... bench:       1,101 ns/iter (+/- 3) = 7266 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_xoroshiro_128_plus  ... bench:       2,442 ns/iter (+/- 21) = 1638 MB/s
test gen_u32_xorshift_128_plus   ... bench:       2,444 ns/iter (+/- 17) = 1636 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_xorshift_128_plus   ... bench:       3,014 ns/iter (+/- 5) = 2654 MB/s
test gen_u64_xoroshiro_128_plus  ... bench:       2,933 ns/iter (+/- 11) = 2727 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Performance is very good on x86_64. On x86 it is ok, but native 32-bit RNGs can be faster.

Memory and period Xorshift128+ and Xoroshiro128+ have a state size of 128 bits, with a corresponding period of 2^128-1.

Conclusion The very high performance of Xoroshiro128+ make it interesting. The quality is good enough for some purposes, but I would argue to look for other RNGs.

pitdicker commented 6 years ago

Evaluation of PCG Permuted Congruential Generators

PCG is a big improvement over the LCG class of RNGs by Melissa O'Neill. There are a large number of variants, because different base generators and different permutation functions can be combined. PCG is designed to be statistically very good, and passes PractRand and BigCrush. Variants suitable for 32 and 64-bit architectures are available. License is public domain.

Probably the most interesting combinations are:

PCG also has the samewhat rare feature of easy seeking in the output stream.

Speed: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_pcg_xsh_64_lcg      ... bench:       1,222 ns/iter (+/- 13) = 3273 MB/s
test gen_u32_pcg_xsl_64_lcg      ... bench:       1,197 ns/iter (+/- 2) = 3341 MB/s
test gen_u32_pcg_xsl_128_mcg     ... bench:       1,461 ns/iter (+/- 1) = 2737 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_pcg_xsh_64_lcg      ... bench:       3,603 ns/iter (+/- 6) = 2220 MB/s
test gen_u64_pcg_xsl_64_lcg      ... bench:       3,439 ns/iter (+/- 19) = 2326 MB/s
test gen_u64_pcg_xsl_128_mcg     ... bench:       1,468 ns/iter (+/- 6) = 5449 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_pcg_xsh_64_lcg      ... bench:       2,953 ns/iter (+/- 75) = 1354 MB/s
test gen_u32_pcg_xsl_64_lcg      ... bench:       2,948 ns/iter (+/- 29) = 1356 MB/s
test gen_u32_pcg_xsl_128_mcg     ... bench:      12,736 ns/iter (+/- 121) = 314 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_pcg_xsh_64_lcg      ... bench:       6,335 ns/iter (+/- 111) = 1262 MB/s
test gen_u64_pcg_xsl_64_lcg      ... bench:       5,769 ns/iter (+/- 68) = 1386 MB/s
test gen_u64_pcg_xsl_128_mcg     ... bench:      13,802 ns/iter (+/- 254) = 579 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Performance is pretty good on x86_64. It is best to choose between a 64/32 and 128/64-variant depending on whether you need to generate u32's or u64's. On x86 it is ok-ish, but there are not all that many RNG's of good quality that are faster. PCG-XSL-RR 128/64 (MCG) performs exceptionally bad on x86, because it needs to do 128-bit multiplication.

Memory and period The 64/32 variants use 64 bits of state, and can be combined with a 64-bit value to select a stream. This brings the memory to 128 bits. As noted in my first comment, the number of initializations that can be done before there is a chance of 1 in a million of partly duplicated streams is 2^27. Using streams is faster and uses less memory than the extension mechanism. The period is 2^64.

The 128/64 variant uses 128 bits of state. In my opinion, support for streams adds little to this variant, and performance can be improved by more than 30% by removing it. That means using a multiplicative congruential generator as basis. That brings the state to 128 bits, and the period to 2^126. For initialization 2^48 RNGs can be created before the chance of duplication gets high.

Conclusion A very interesting family of RNGs that is worth considering. Using multiplications, and especially 128-bit multiplications (which are unstable) is a disadvantage.

pitdicker commented 6 years ago

Evaluation of Velox 3b

A small generator by Elias Yarrkov. It is designed for 32-bit architectures. License is public domain.

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_velox               ... bench:       2,007 ns/iter (+/- 35) = 1993 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_velox               ... bench:       3,667 ns/iter (+/- 35) = 2181 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_velox               ... bench:       2,934 ns/iter (+/- 7) = 1363 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_velox               ... bench:       5,282 ns/iter (+/- 82) = 1514 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

I don't really understand how this RNG works. It uses two arrays of 4 u32's internally, so 256 bits of state. It does not generate one value at a time, but in blocks of 4. This is uncommon for small RNGs, and more like a cryptographic RNG.

Initialization and period The RNG takes 32 bits as a seed, but I suspect it can work just as well using 128 bits. "Seeds can't begin to overlap before generating at least 2^128 blocks." This seems to indicate every seed gets its own period. Then 2^55 initializations can be done before there is a high chance of partially duplicated streams.

The minimum period for 128-bit blocks is guaranteed to be 2^128, with an expected average period of 2^255.

Conclusion The guarantees this RNG offers make it interesting. Its performance is comparable to the cryptographic RNGs like ISAAC and HC-128 (that I am working on). The performance is less than can be expected of small RNGs though. And operating in blocks of 4 output values at a time is unexpected for a simple RNG.

dhardy commented 6 years ago

Looks to me like V3b has the simplest state-advance possible (increment by 1), and a complex permutation function, hence the period is exactly 2^128 steps (each with 4 outputs, so 2^130).

Edit: no, that's not it at all. The 128-bit state v is permuted each step. A 128-bit counter is incremented each step and added to the state v. Then v is output. This means that after every four outputs the entirety of v is known, the permutations on v are also known (if one can guess the algorithm), so the difference between the next step and the permutated v are the counter. In other words, it's trivially predictable after 8 outputs.

pitdicker commented 6 years ago

I had trouble understanding that line, but that is exactly it 👍

pitdicker commented 6 years ago

Evaluation of XSM Xor-Shift and Multiply

XSM is designed by Chris Doty-Humphrey (the author of PractRand) with the goal to support random access aka seeking. It passes PractRand and BigCrush. Variants suitable for 32 and 64-bit architectures are available. License is public domain.

XSM is revised several times already, and another revision is coming up for PractRand 0.94.

The idea behind XSM is that it uses an LCG as base generator, and combines that with a relatively complex non-linear output function. I am not sure whether the LCG part is still true, it seems the base generator is more a combination of "creative" additions. I am not sure there is enough structure to it to be able to talk about periods. The Xor-Shift part is also gone after some revisions.

Speed: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_xsm32               ... bench:       2,511 ns/iter (+/- 8) = 1592 MB/s
test gen_u32_xsm64               ... bench:       2,423 ns/iter (+/- 8) = 1650 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_xsm32               ... bench:       3,901 ns/iter (+/- 17) = 2050 MB/s
test gen_u64_xsm64               ... bench:       2,423 ns/iter (+/- 8) = 3301 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_xsm32               ... bench:       3,189 ns/iter (+/- 157) = 1254 MB/s
test gen_u32_xsm64               ... bench:       6,007 ns/iter (+/- 283) = 665 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_xsm32               ... bench:       4,346 ns/iter (+/- 190) = 1840 MB/s
test gen_u64_xsm64               ... bench:       6,282 ns/iter (+/- 311) = 1273 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Conclusion This RNG could be interesting for its ability to go both forward and backwards. It is still under development. Without a good study of it or better documentation I don't trust it much.

pitdicker commented 6 years ago

Evaluation of Middle Square Weyl Sequence RNG

The Middle Square Weyl Sequence RNG is designed by Bernard Widynski. It reportedly passes BigCrush, with get suspected in PractRand after 4gb, and fails after 8gb. It operates on 64-bit words. License is GPL for the software package, but I am not sure that applies to the paper.

It combines the middle square method from John von Neumann with a Weyl sequence.

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_msws                ... bench:         971 ns/iter (+/- 2) = 4119 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_msws                ... bench:         971 ns/iter (+/- 1) = 8238 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_msws                ... bench:       1,919 ns/iter (+/- 67) = 2084 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_msws                ... bench:       1,995 ns/iter (+/- 72) = 4010 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

MSWC has extremely good performance on both 32- and 64-bit architectures, outperforming al others tested so far (Also Xorshift and Xoroshiro).

The state includes 64 bits for the Middle Square state, and 64 bits for the Weyl sequence. It also needs to store one u64 for the current stream.

Initialization and period. Similar to 64-bit PCG with streams, the number of initialization that can be done before there is a chance of 1 in a million of partly duplicated streams is 2^27. The period is 2^64.

Conclusion The performance and quality of this RNG put it in a similar position as Xoroshiro128+. It is between 13 and 47% faster. Both do not pass PractRand, and BigCrush sometimes. The GPL license could be a problem.

pitdicker commented 6 years ago

SFC cycle length

According to the paper Chaotic Random Number Generators with Random Cycle Lengths for chaotic RNGs it is a good idea to talk about the chance to get a short cycle, because guarantees about minimal cycle lengths can often not be given.

The 32 bit version of the Small Fast Counting RNG (SFC) provides a minimum cycle guarantee of 2^32, thanks to the combination with a 32-bit counter. Every cycle length of the chaotic part is multiplied by this number. To get a minimum worry-free amount of 2^50 results from the RNG, we need to know the chance to get a cycle from the chaotic part of 2^50 / 2^32 = 2^18.

SFC has the good property that every state change is invertible. This implies two different states can never have the same next state. Once the state becomes the initial state, the cycle is closed.

For an invertible chaotic RNG the chance to get a cycle of 2^18 with a state of 96 bits is simply: 2^18 / 2^96 = 1 in 10^23.

So while we can't give an absolute guarantee Sfc32 provides a minimum cycle length of 2^50, the chance to get a smaller cycle is unbelievably small.

dhardy commented 6 years ago

the chance to get a cycle of 2^18 with a state of 96 bits is simply: 2^18 / 2^96

You're assuming exactly one cycle of length 2^18 exists. Why?

Without knowing more about the distribution of cycle lengths you can't say anything. There could be many fixed points (cycles of length 2^32) or it could be that most points are part of cycles with length less than 2^18, given only the information above.

pitdicker commented 6 years ago

Oops I means to say a cycle less than 2^18, and should have added much more disclaimer text 😄 .

The linked paper says we can assume the length of the cycles to follow a logarithmic distribution (formula 14). We can assume there are about ln(2^96)=67 cycles (formula 12, 13).

If there is a cycle of say 2^95, there is a 50% chance for a random initialization to end up in that one. The chance to end up in a cycle of length 1 is only 1 in 2^96.

I was making a big assumption that this RNG is well designed, and the cycles roughly follow the same distribution. Thanks to the logarithmic distribution the chance to end up in a short cycle if very small.

Sadly I see no real way to test this, even the 16-bit version has a state that is hard to test. Edit: testing the 16-bit variant should be possible, I expect to be able to measure the cycle length of 2 seeds per day on average (4 running in parallel). Might just do that...

dhardy commented 6 years ago

Sounds like SFC32 is "okay" then. Great as a member of a supporting library, not sure about making it default.

pitdicker commented 6 years ago

I have seen some comments saying that chaotic RNGs are to "random" and theoretically unpredictable to work with. And others saying that is just what you would want from an RNG :smile:.

Periodic RNGs like LCG and Xorshift are theoretically simple. Combining them with an output function, like PCG and Xorshift* do, makes them statistically strong.

You could argue both have a place in the world. For a default something periodic seems like a more conservative choice. I am just collecting and trying to be objective for now, and have a few more to go...

pitdicker commented 6 years ago

Evaluation of Xorshift / Xoroshiro with multiply and truncation

This are a variants on Xorshift* noted by Melissa O'Neil, but they may be more common. I put some effort into playing with them. A requirement to pass statistical tests is that the multiplier must be 64 bit, which has the consequence that the word size should also be 64 bit. With this they pass PractRand and BigCrush. License is public domain.

Instead of Xorshift as a base generator, Xoroshiro could be a better choice as is does one instruction less.

For 32-bit results the Xorshift/Xoroshiro result is multiplied by a constant, and truncated to the 32 most significant bits. For 64-bit results it is possible to use a widening multiply, and truncate to the middle 64 bits. On x86_64 the widening multiply is faster, on x86 combining two 32-bit results is faster.

Speed: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_xoroshiro_mt_32of128... bench:       1,060 ns/iter (+/- 9) = 3773 MB/s
test gen_u32_xoroshiro_mt_64of128... bench:       1,082 ns/iter (+/- 109) = 3696 MB/s
test gen_u32_xorshift_mt_64      ... bench:       1,280 ns/iter (+/- 4) = 3125 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_xoroshiro_mt_32of128... bench:       3,565 ns/iter (+/- 16) = 2244 MB/s
test gen_u64_xoroshiro_mt_64of128... bench:       1,094 ns/iter (+/- 12) = 7312 MB/s
test gen_u64_xorshift_mt_64      ... bench:       1,271 ns/iter (+/- 30) = 6294 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_xoroshiro_mt_32of128... bench:       3,658 ns/iter (+/- 52) = 1093 MB/s
test gen_u32_xoroshiro_mt_64of128... bench:       3,661 ns/iter (+/- 44) = 1092 MB/s
test gen_u32_xorshift_mt_64      ... bench:       3,482 ns/iter (+/- 13) = 1148 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_xoroshiro_mt_32of128... bench:       6,174 ns/iter (+/- 16) = 1295 MB/s
test gen_u64_xoroshiro_mt_64of128... bench:      11,884 ns/iter (+/- 83) = 673 MB/s
test gen_u64_xorshift_mt_64      ... bench:      11,624 ns/iter (+/- 25) = 688 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Xoroshiro with multiply and truncation is always faster than Xorshift with those output functions, so it makes sense to go with Xoroshiro as a base generator. The only difference in these tests between xoroshiro_mt_32of128 and xoroshiro_mt_64of128 is using a widening multiply on x86_64.

On x86_64 the performance is equal to Xoroshiro+, or even 12% better for 32-bit results. On x86 it performs between 33% and 53% worse than Xoroshiro+.

Memory and period As state size I choose 128 bits, two u64 words. The period is the same as the Xorshift and Xoroshiro base generators: 2^128 - 1.

Conclusion To my knowledge these are the fastest periodic RNGs that pass PractRand and BigCrush on x86_64. PCG could be a better choice on x86.

vks commented 6 years ago

What is the advantage of using xoroshiro* over xoroshiro+? The widening multiply to get rid of the linear dependencies instead of truncating?

I still don't think it's worth it to get rid of the linear dependencies.

On Wed, Nov 29, 2017, 23:50 Paul Dicker notifications@github.com wrote:

Evaluation of Xorshift / Xoroshiro with multiply and truncation

This are a variants on Xorshift* noted by Melissa O'Neil http://www.pcg-random.org/posts/other-good-options.html, but they may be more common. I put some effort into playing with them. A requirement to pass statistical tests is that the multiplier must be 64 bit, which has the consequence that the word size should also be 64 bit. With this they pass PractRand and BigCrush. License is public domain.

Instead of Xorshift as a base generator, Xoroshiro could be a better choice as is does one instruction less.

For 32-bit results the Xorshift/Xoroshiro result is multiplied by a constant, and truncated to the 32 most significant bits. For 64-bit results it is possible to use a widening multiply, and truncate to the middle 64 bits. On x86_64 the widening multiply is faster, on x86 combining two 32-bit results is faster.

Speed:

Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_xoroshiro_mt_32of128... bench: 1,060 ns/iter (+/- 9) = 3773 MB/s test gen_u32_xoroshiro_mt_64of128... bench: 1,082 ns/iter (+/- 109) = 3696 MB/s test gen_u32_xorshift_mt_64 ... bench: 1,280 ns/iter (+/- 4) = 3125 MB/s test gen_u32_xorshift_128_32 ... bench: 1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_xoroshiro_mt_32of128... bench: 3,565 ns/iter (+/- 16) = 2244 MB/s test gen_u64_xoroshiro_mt_64of128... bench: 1,094 ns/iter (+/- 12) = 7312 MB/s test gen_u64_xorshift_mt_64 ... bench: 1,271 ns/iter (+/- 30) = 6294 MB/s test gen_u64_xorshift_128_32 ... bench: 2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_xoroshiro_mt_32of128... bench: 3,658 ns/iter (+/- 52) = 1093 MB/s test gen_u32_xoroshiro_mt_64of128... bench: 3,661 ns/iter (+/- 44) = 1092 MB/s test gen_u32_xorshift_mt_64 ... bench: 3,482 ns/iter (+/- 13) = 1148 MB/s test gen_u32_xorshift_128_32 ... bench: 1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_xoroshiro_mt_32of128... bench: 6,174 ns/iter (+/- 16) = 1295 MB/s test gen_u64_xoroshiro_mt_64of128... bench: 11,884 ns/iter (+/- 83) = 673 MB/s test gen_u64_xorshift_mt_64 ... bench: 11,624 ns/iter (+/- 25) = 688 MB/s test gen_u64_xorshift_128_32 ... bench: 4,591 ns/iter (+/- 59) = 1742 MB/s

Xoroshiro with multiply and truncation is always faster than Xorshift with those output functions, so it makes sense to go with Xoroshiro as a base generator. The only difference in these tests between xoroshiro_mt_32of128 and xoroshiro_mt_64of128 is using a widening multiply on x86_64.

On x86_64 the performance is equal to Xoroshiro+, or even 12% better for 32-bit results. On x86 it performs between 33% and 53% worse than Xoroshiro+.

Memory and period As state size I choose 128 bits, two u64 words. The period is the same as the Xorshift and Xoroshiro base generators: 2^128 - 1.

Conclusion To my knowledge these are the fastest periodic RNGs that pass PractRand and BigCrush on x86_64. PCG could be a better choice on x86.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhardy/rand/issues/60#issuecomment-347903039, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCtGcLgRr9sjLyrLVc7PI59tzr8F23ks5s7X1SgaJpZM4Qpoy4 .

pitdicker commented 6 years ago

Evaluation of KISS generators

The KISS generators are a family of RNGs by George Marsaglia where he combines several small RNGs to create one statistically good RNG.

The 32 bit variant was originally designed in 1993, and it has been improved in 1999. It combines two Multiply-With-Carry generators, a LCG and an Xorshift RNG. In 2009 he designed a similar 64-bit variant. They are reported to pass BigCrush, and I believe they pass PractRand also. License is public domain.

Speed: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_kiss32              ... bench:       3,015 ns/iter (+/- 1) = 1326 MB/s
test gen_u32_kiss64              ... bench:       3,023 ns/iter (+/- 4) = 1323 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_kiss32              ... bench:       5,168 ns/iter (+/- 15) = 1547 MB/s
test gen_u64_kiss64              ... bench:       3,019 ns/iter (+/- 6) = 2649 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_kiss32              ... bench:       3,668 ns/iter (+/- 66) = 1090 MB/s
test gen_u32_kiss64              ... bench:       9,084 ns/iter (+/- 124) = 440 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_kiss32              ... bench:       8,763 ns/iter (+/- 98) = 912 MB/s
test gen_u64_kiss64              ... bench:       8,918 ns/iter (+/- 135) = 897 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Unsurprisingly combining 3 RNGs into one is does not create something very fast, although it performs better than I expected. LLVM is not very good at reordering steps from the 3 independent RNG's to take advantage of instruction level parallelism, I expect performance can be improved by maybe 10% yet.

Memory, period and seed The memory usage of the 32-bit variant is 128 bits. The two MWC generators combine to have a period of 2^60. The Xorshift component has a period of 2^32 - 1, and the LCG a period of 2^32. The period of all three combined is about 2^123.

The 64-bit variant combines three generators, and requires 256 bits of state. The MWC generator has a period of 2^121 + 2^63 - 1. The Xorshift component has a period of 2^64 - 1, and the LCG of 2^64. This gives a combined period of about 2^247.

The 32-bit variant takes 128 bits as seed, the 64-bit variant 250 (6 bits of the MWC carry-value are ignored).

Conclusion Combining three or four RNGs to make one that is statistically good makes these interesting, but their relatively low performance less so. KISS is one of the very few 32-bit RNGs that can pass BigCrush.

pitdicker commented 6 years ago

Edit: Removed this comment because it was completely wrong (If it sounds to good to be true, it probably is :smile:)

pitdicker commented 6 years ago

Evaluation of Choatic Iterations PRNG

Two ways to improve the statistical quality of simple RNG's is to add an output function, or to combine different types of RNG's together. A relatively new alternative is chaotic iterations. This technique uses 3 bits of one RNG to pick 1 to 4 iterations of another RNG. The authors published many papers about this technique, and it is is now at version 3.

In theory this technique can be used with any kind of small RNG. One of the papers on version 2 evaluated the results of using different base generators. While their statistical properties improved, it takes a large range of iterations to even pass DieHARD.

The variant in the paper describing version 3 uses Xorshift as base generator, and Blum Blum Shub to pick the chaotic iterations. This seems to pass PractRand (so far), and the authors even claim some cryptographic security. That may be a bit optimistic though.

Speed and memory: Benchmarks from my pc, with Xorshift128/32 (current RNG in rand) as a baseline. x86_64:

test gen_u32_ci                  ... bench:       5,712 ns/iter (+/- 6) = 700 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_ci                  ... bench:      10,498 ns/iter (+/- 10) = 762 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_ci                  ... bench:       5,725 ns/iter (+/- 135) = 698 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_ci                  ... bench:      10,543 ns/iter (+/- 49) = 758 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

To improve performance I have replaced the three branches with bit masking tricks (improves it 3x). The performance on x86 is the same as on x86_64, which I have not seen before.

The memory used for this version if 196 bits.

Period and initialization I expect the period of this RNG to be the period of one Xorshift times that of one BBS, and 2^32 because of the CI state. So about 2^64 2^32 2^32 = 2^128.

The seed to initialize both Xorshifts in this variant and the Blum Blum Shub generator may not be 0. A seed of 192 bits is enough for worry-free initialization.

Conclusion Chaotic Iterations to improve the quality of RNG's is an interesting technique. They certainly give non-trivial to predict output, whether it is cryptographically secure remains to be seen. Performance is not great.

TheIronBorn commented 6 years ago

I am also curious why xoroshiro is used. Melissa O'Neil noted that merely taking xoroshiro128+ and "discarding the low-order bits, to return only 32 bits ... passes PractRand and TestU01". For generating next_u64 though, xoroshiro seems like a faster alternative to concatenating two next_u32 calls.

pitdicker commented 6 years ago

It is just a variant I explored. Good link though, thank you. In another blog post she commented on the structure that becomes visible in Xoroshiro+. The author of PractRand also commented on the linearity of Xoroshiro+ output.

So while it passes PractRand and TestU01, that is only because the state space it too large for the test to see the pattern. With a multiply and truncation just about all structure is gone.

vks commented 6 years ago

It is not clear how the pattern in low dimensions generalizes to higher dimensions. I don't buy this "state space" argument.

On Fri, Dec 8, 2017, 01:49 Paul Dicker notifications@github.com wrote:

It is just a variant I explored. Good link though, thank you. In another blog post she commented on the structure that becomes visible in Xoroshiro+ http://www.pcg-random.org/posts/visualizing-the-heart-of-some-prngs.html#xoroshiro-with-additive-output-function. The author of PractRand also commented on the linearity of Xoroshiro+ output.

So while it passes PractRand and TestU01, that is only because the state space it to large for the test to see the pattern. With a multiply and truncation just about all structure is gone.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhardy/rand/issues/60#issuecomment-350082588, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCtF8nq2D3v1giN_NmODRO1gENvNllks5s-Eg_gaJpZM4Qpoy4 .

pitdicker commented 6 years ago

@vks What is it in Xoroshiro+ that makes you want to defend it so much? There is a very large body of work around RNG's, and Xoroshiro+ is certainly not the only one. Have you compared it objectively against others like I am trying to do here?

I like it for it's speed, and Xoroshiro (without the +) can serve as a good and somewhat novel base algorithm.

The PCG paper expressed nicely that a lot of this stuff is just engineering trade-offs between quality and performance. The last couple of decennia there has been a huge push for better quality, and Xoroshiro+ goes a bit against the flow here and aims for performance above quality. Note Vigna has published a paper about the much better Xorshift* first.

Why would we not choose something different if the performance is somewhat comparable, and the quality is better? And if quality is not an issue, where do you draw the line, why not use an even faster RNG?

dhardy commented 6 years ago

I agree with @pitdicker, we should pick a fast generator, but one with very good statistical quality. However, we can also make other generators like Xoroshiro+ available in a companion crate.

I also agree with @vks that the PCG author is making a jump claiming that the cut-down versions of generators are representative of larger versions, but lacking better analysis tools seems worth some attention.

Actually, it should be possible to do something similar with 32-bit generators. Most have too long periods to fully explore, but if took 2^32 samples (feasible based on the performance numbers), choose some fixed 16-bit number and disregarded all samples whose high 16 bits does not match this fixed number, we could perform similar analysis with the low 16-bits of the remaining numbers. (We could also do the same for the high 16 bits.)

Okay, that's quite a bit of work and I'm not convinced we need to do it, especially since all we are trying to do here is choose a default fast generator which can be changed any time (and may well not be the same on all architectures either).

pitdicker commented 6 years ago

Where to go from here?

Definitely not all RNG's ever thought up are tested here. There are just to many. I think we have a good overview of the landscape now though, and don't expect to find something much better lurking somewhere. Yet, including an LCG with a modulus that is not a power of two (which can still be efficient) and RANROT (although it is not considered relevant any more by its author) would be interesting.

We ended up with several families of RNG's that have excellent quality, and that are based on different principles:

The performance of the last two are much worse than the others, so we can cross them off. I have not included other choatic RNG's in this list because SFC is much faster and offers better quarantees.

Small Fast Counting RNG

On x86_64 this is the fastest good quality RNG. There are two variants, where the 32-bit version is the better choice for everything except generating u64's on x86_64.

SFC has two negative points:

This means SFC does not at the moment pass the 'Reasonably good theoretical basis' requirement. I think it is a good idea to contact the author and see if he want to provide some background.

The other negative point, that it is a choatic RNG, can be both an advantage and a disadvantage. I can see situations where people would not be confident in it, and would prefer a periodic RNG. So SFC is great as an alternative RNG, but I would choose a conservative periodic RNG as default.

PCG

This is probably the RNG most liked by people in the Rust community, it is mentioned often in the discussions around rand. And I think that is well deserved. It has an excellent design, and the author can clearly communicate the theory and trade-offs behind it.

The large number of possible variants and choices can be overwhelming, it is possible to pick a base RNG, permutation function, streams and extension mechanism. But we can narrow it down to two best choices:

The performance of PCG-XSL-RR 128/64 (MCG) on x86 is currently very bad. This has to do with doing 128-bit multiplications on a 32-bit architecture. What makes it worse is that the multiplication function at the moment does not get inlined, approximately halving the performance.

Xoroshiro with multiply and truncation

This is the fastest good-quality periodic RNG in this comparison. The big disadvantage is that it is put together by me. It combines well-known techniques: a Xoroshiro base generator by Vigna, multiply as permutation function as noted in "Numerical Recipes" for Xorshift*, truncation as recommended by O'Neil, and widening as a simple improvement step by me.

The problem is the choice of multiplier. For that I used a list of multipliers that perform well on spectral tests by L'Ecuyer. To generate 32-bit results the best multiplier is the one for m=2^48 modulus 2^32, because the truncation removes the last 16 bits. Similar for 64-bit results the best multiplier is the one for m=2^96 modulus 2^64.

A possible area of improvement could be to search for a good multiplier ourselves. There probably exist multipliers for just the 64 bits we use that are better than the best multiplier for 96 bits.

With inlining the widening multiply on x86, and also using widening multiply for next_u32 thanks to the improved multipliers, the performance of this RNG is now reasonably good on x86.

Comparative benchmarks

Benchmarks from my pc, and again with the current RNG in rand (Xorshift128/32) as a baseline: x86_64:

test gen_u32_sfc_32              ... bench:         980 ns/iter (+/- 4) = 4081 MB/s
test gen_u32_sfc_64              ... bench:       1,221 ns/iter (+/- 1) = 3276 MB/s
test gen_u32_pcg_xsh_64_lcg      ... bench:       1,222 ns/iter (+/- 13) = 3273 MB/s
test gen_u32_pcg_xsl_128_mcg     ... bench:       1,461 ns/iter (+/- 1) = 2737 MB/s
test gen_u32_xoroshiro_mt_64of128... bench:       1,081 ns/iter (+/- 5) = 3700 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_sfc_32              ... bench:       3,218 ns/iter (+/- 9) = 2486 MB/s
test gen_u64_sfc_64              ... bench:         980 ns/iter (+/- 3) = 8163 MB/s
test gen_u64_pcg_xsh_64_lcg      ... bench:       3,603 ns/iter (+/- 6) = 2220 MB/s
test gen_u64_pcg_xsl_128_mcg     ... bench:       1,468 ns/iter (+/- 6) = 5449 MB/s
test gen_u64_xoroshiro_mt_64of128... bench:       1,089 ns/iter (+/- 4) = 7346 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_sfc_32              ... bench:       2,434 ns/iter (+/- 7) = 1643 MB/s
test gen_u32_sfc_64              ... bench:       4,125 ns/iter (+/- 14) = 969 MB/s
test gen_u32_pcg_xsh_64_lcg      ... bench:       2,953 ns/iter (+/- 75) = 1354 MB/s
test gen_u32_pcg_xsl_128_mcg     ... bench:      12,736 ns/iter (+/- 121) = 314 MB/s
test gen_u32_xoroshiro_mt_64of128... bench:       2,441 ns/iter (+/- 12) = 1638 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_sfc_32              ... bench:       3,568 ns/iter (+/- 15) = 2242 MB/s
test gen_u64_sfc_64              ... bench:       4,914 ns/iter (+/- 17) = 1628 MB/s
test gen_u64_pcg_xsh_64_lcg      ... bench:       6,335 ns/iter (+/- 111) = 1262 MB/s
test gen_u64_pcg_xsl_128_mcg     ... bench:      13,802 ns/iter (+/- 254) = 579 MB/s
test gen_u64_xoroshiro_mt_64of128... bench:       5,881 ns/iter (+/- 75) = 1360 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

Interesting is that nothing can match the performance of generating 32-bit integers with Xorshift128/32 on x86. I think this is a necessary consequence of the thorough mixing of the state with every round that the RNG's with good statistical quality do. Xorshift128/32 only updates 1/4 of the state every round.

Picking a default

PCG has the big advantage that is of good quality, with good presentation, and liked by the community. A disadvantage is that even with the choice narrowed down to just two variants, there is still a choice to make.

Xoroshiro with multiply and truncation outperforms PCG-XSH-RR 64/32 (LCG) by 7~13%, and for u64 on x86_64 by a factor 3.3. And in a fairer comparison to PCG-XSL-RR 128/64 (MCG) by 35%. A disadvantage is that I may yet want to search for a better multiplier.

PCG also has a feature XoroshiroMT can not have: relatively fast seeking through the period.

The choice between XoroshiroMT and PCG is difficult. And I don't want to suggest XoroshiroMT because I had a hand in it. But because their quality is similar, XoroshiroMT is always faster, and for PCG you have to pick between two variants for the best choice, XoroshiroMT seems the best default.

If it was up to me I would include all three RNG's discussed here (5 variants) in rand or an external crate. They are all good, and have their own advantages and disadvantages.

@dhardy, what do you think?

dhardy commented 6 years ago

@pitdicker I think you've done some relatively thorough and quite careful research here; it would seem good material for a publication (or even two: analysis of existing generators and the design of Xoroshiro MT). Nevertheless PRNGs are a tricky subject and it may be worth getting in contact with others who've been working in the field.

On choosing one as a default, in a way it doesn't matter too much since we reserve the right to change it (though should probably require at least bumping the minor version number in this case). So far Xoroshiro MT does sound very good all round; I'm not able to say any more than you are about its quality. If unsure we can start with a conservative choice. It may be however that a companion crate with a selection of PRNGs is available before long anyway, making it easy for users to choose themselves.

It would also seem worth assembling a set of generators in an external crate (assuming licences are compatible) along with benchmark code, and then ask people to benchmark on several different platforms. @pitdicker would you be happy to start such a crate?

pitdicker commented 6 years ago

@pitdicker would you be happy to start such a crate?

I have :smile: https://github.com/pitdicker/small-rngs

pitdicker commented 6 years ago

There is something wrong with benchmarks on the latest Rust nightlies if you want to run the benchmarks. It adds a couple of extra branches for some reason, halving the results. I will have to open a bug report.

vks commented 6 years ago

@pitdicker I'm a bit worried about some claims about xoroshiro (orginating from O'Neill) that are repeated here. I had a private discussion with Vigna about them and agree with him that they are questionable. In the interest of an evenhanded discussion, I tried to represent this point of view.

The only statistical weakness of xoroshiro128+ I consider established are the linear dependencies in the lowest two bits. Are you referring to something else when talking about xoroshiro's "lower statistical quality"? I don't think the linear dependencies are relevant in practice, since the commonly used RNGs have them in every bit. Your truncation gets rid of them at the cost of some performance. I think this is fair, if you really want to get rid of them for some reason. You are choosing xoroshiro* over the faster xoroshiro+ presumably based on the claims generalizing low to high dimensions, which I think is questionable.

Personally, I draw the line for statistical quality at "not worse than the Mersenne Twister", because I'm not aware of real-life usecases where its quality was not good enough. (In theory, generating binary vectors might be a problem, but I don't know of any real-world example.) The secondary criterion for me is performance. This is why I favor xoroshiro128* over the others.

On Thu, Dec 7, 2017, 22:06 Paul Dicker notifications@github.com wrote:

@vks https://github.com/vks What is it in Xoroshiro+ that makes you want to defend it so much? There is a very large body of work around RNG's, and Xoroshiro+ is certainly not the only one. Have you compared it objectively against others like I am trying to do here?

I like it for it's speed, and Xoroshiro (without the +) can serve as a good and somewhat novel base algorithm.

The PCG paper expressed nicely that a lot of this stuff is just engineering trade-offs between quality and performance. The last couple of decennia there has been a huge push for better quality, and Xoroshiro+ goes a bit against the flow here and aims for performance above quality. Note Vigna has published a paper about the much better Xorshift* first.

Why would we not choose something different if the performance is somewhat comparable, and the quality is better? And if quality is not an issue, where do you draw the line, why not use an even faster RNG?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhardy/rand/issues/60#issuecomment-350094269, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCtOAO3jf9NqFQOPfpiqsu4ttynczvks5s-FNOgaJpZM4Qpoy4 .

pitdicker commented 6 years ago

The problem is that I agree with the claims :smile:. But let's not discuss it again.

In https://github.com/dhardy/rand/issues/58 we are discussing to make a new rand-rngs crate to collect good implementations of good RNG's. I would like to see Xoroshiro128+ in it.

I think four categories would be good, with an example in https://github.com/dhardy/rand/issues/58#issuecomment-354435536. What I named low(er) quality but very fast or very small should be worded a little better. Basically the idea is the RNG may be excellent for many uses, but we should spell them out together with a little background. Xoroshiro128+ and Mersenne Twister can for example be fine for generating floating point numbers, and I am sure you can name more situations they are excellent for. I just would not recommend it as default, as, well, who knows how the standard PRNG from rand is going to be misused...

If I and/or @dhardy extract the PRNG's from rand, and I add some from small-rngs, would you work together on adding Xoroshiro+ and maybe SplitMix?

vks commented 6 years ago

There has been a new PRNG (xoshiro) by Vigna and Blackman. They suggest xoshiro256** for all purposes. It gets rid of the linear dependencies of xoroshiro and has a period suitable for all parallel applications. For floating point numbers, they suggest xoshiro256+, which is faster but has linear dependencies in the 3 lowest bits, which can be ignored when generating floating-point numbers.

The also suggest analogous 32-bit generators.

dhardy commented 6 years ago

We should also consider large MCGs and LCGs, at least on 64-bit platforms: http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html

vigna commented 4 years ago

Wow. That's a long and thorough discussion (just got a reference to this thread from the current interaction on the library).

Let me just make some easy points: