Better default RNG in the future?

PallHaraldsson commented 6 years ago

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG." I previously quoted: ~~"The clear winner is xoroshiro128+"~~

https://nullprogram.com/blog/2017/09/21/

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. ..

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

StefanKarpinski commented 6 years ago

Too late for this release but the default RNG stream is explicitly documented to not be stable so we can change it in 1.x versions. Another contender is the PCG family of RNGs.

PallHaraldsson commented 6 years ago

Ok, I thought last change (people assume stable). It's 7 lines of C code.

PallHaraldsson commented 6 years ago

I'm on smartphone, I at least can't locate "(un)stable" at

https://docs.julialang.org/en/latest/stdlib/Random/

PallHaraldsson commented 6 years ago

MT, Julia's current default, is "between 1/4 and 1/5 the speed of xoroshiro128+." and MT has "one statistical failure (dab_filltree2)."

"PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software."

http://xoshiro.di.unimi.it

64-bit Generators xoshiro256** (XOR/shift/rotate) is our all-purpose, rock-solid generator (not a cryptographically secure generator, though, like all PRNGs in these pages). It has excellent (sub-ns) speed, a state space (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 8 TB of output in our test. We believe this slight bias cannot affect any application.

Finally, if for any reason (which reason?) you need more state, we provide in the same vein xoshiro512 / xoshiro512+ and xoroshiro1024 / xoroshiro1024* (see the paper).

andreasnoack commented 6 years ago

It looks like the timings are based on his own implementation of MT. The one we use is probably quite a bit faster.

Update: The Dieharder test suite is also considered outdated

PallHaraldsson commented 6 years ago

Also intriguing-just use AES:

Speed The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns.

JeffreySarnoff commented 6 years ago

Extensive testing by others (Daniel Lemire, John D. Cook) favors PCG and notes a few problems with the xor__s. One or the other or another -- it will require Julia-centric testing and benchmarking.

Nosferican commented 6 years ago

Random is a stdlib so it could have breaking changes and a new release before the next release of the language, no?

StefanKarpinski commented 6 years ago

Yes, it could have breaking changes before Julia 2.0.

staticfloat commented 6 years ago

Just for fun, I coded up a prototype implementation:

# A quick test to test a new RNG for Julia
using BenchmarkTools

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # Make sure you use xor() instead of ^
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # I love magic constants
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

function xoroshiro128plus(N, s)
    x = Array{UInt64}(undef, N)
    @inbounds for i = 1:N
        # copy state to temp variables
        s1, s2 = (s[1], s[2])

        # Calculate result immediately
        x[i] = s1 + s2

        # Mash up state
        s2 = xor(s2, s1)
        s1 = xor(((s1 << 55) | (s1 >> 9)), s2, (s2 << 14))
        s2 = (s2 << 36) | (s2 >> 28)

        # Save new state
        s[1], s[2] = (s1, s2)
    end
    return x
end

# Collect some performance data
for N in round.(Int64, 2 .^ range(12, stop=20, length=3))
    println("Testing generation of $N 64-bit numbers")
    local s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime xoroshiro128plus($N, $s)
    @btime rand(UInt64, $N)
end

This gives surprisingly good timings, considering MT has been optimized for many a year:

Testing generation of 4096 64-bit numbers
  13.889 μs (2 allocations: 32.08 KiB)
  8.136 μs (2 allocations: 32.08 KiB)
  6.733 μs (2 allocations: 32.08 KiB)
Testing generation of 65536 64-bit numbers
  219.095 μs (2 allocations: 512.08 KiB)
  130.786 μs (2 allocations: 512.08 KiB)
  100.082 μs (2 allocations: 512.08 KiB)
Testing generation of 1048576 64-bit numbers
  4.658 ms (2 allocations: 8.00 MiB)
  3.245 ms (2 allocations: 8.00 MiB)
  3.118 ms (2 allocations: 8.00 MiB)

I note that @inbounds actually makes xorshift1024 slower, which is why that's not included in the code above.

StefanKarpinski commented 6 years ago

I don't think it's all that surprising—MT is pretty complex. One of the huge advantages of xoroshiro and PCG is that they're much simpler and easier to generate really complex code for.

JeffreySarnoff commented 6 years ago

they put the suit in pseutdorandom

Godisemo commented 6 years ago

The package RandomNumbers.jl implements many different RNGs, including xoroshiro128+ which is quite a bit faster than the standard RNG from base according to the benchmarks in the documentation.

PallHaraldsson commented 6 years ago

Is Float rand more important than Int rand? It seems it may matter:

@JeffreySarnoff thanks for the links; still "favors PCG and notes a few problems with the xor__s." seems not conclusive nor comparing to latest variants.

Java's shiftmix64 is also interesting.

Also Google's https://github.com/google/randen as per AES comment above.

PallHaraldsson commented 6 years ago

https://github.com/sunoru/RandomNumbers.jl/issues/1#issuecomment-233533191

@sunoru "Yes, I have considered accelerating some RNGs with GPU"

I guess AES is not available on GPUs so that may be a factor to NOT choose AES as default on CPUs? I.e. would we want same default on GPU and CPU?

StefanKarpinski commented 6 years ago

In general, it's fine (and often good) to drop some extra output bits. In PCG, IIRC, you split the internal output into external output bits and permutation bits in a way that can be adjusted, raising the possibility that we could use different output steps based on the same state to generate 64 bits of integer output versus 53 bits of floating-point output. Another approach would be to use different RNG streams for different bit sizes, which is appealing for being able to generate random 128-bit values, for example. (A pair of 128-bit streams may not be as good as a single 128-bit stream, for example, unless the two streams have coprime periods which is hard to arrange, so it can be quite helpful to have different sized streams that take advantage of their increased size to produce better streams.)

PallHaraldsson commented 6 years ago

"The clear winner is xoroshiro128+" (seems outdated) @staticfloat thanks for coding it. Actually "xoshiro256**" seems to be their best; note dropped letters, I first thought a typo.

sunoru commented 6 years ago

I'm sorry that I haven't updated that package for long. But maybe this is worth a look: http://sunoru.github.io/RandomNumbers.jl/latest/man/benchmark/

My personal choice is xoroshiro128+. Since some RNGs in PCG don't pass the big crush as its paper says, I don't know if PCG is as reliable as it sounds.

StefanKarpinski commented 6 years ago

PCG not passing big crush is indeed an issue. That seems like a strange disconnect.

staticfloat commented 6 years ago

@sunoru I was just looking at your RandomNumbers.jl page; that is some really nice work you've put together. IMO, discussion about a better RNG should be informed by your work as you've put in the time and effort to actually compare against quite a number of different PRNGs. And yes, it does appear true that xoroshiro128+ is the winner in all aspects so far; however I am not sure what the test for quality is nowadays; I've heard of BigCrush, DieHarder, TestU01, etc..... I'm sure Andreas can answer which is the test to pay attention to these days. :)

JeffreySarnoff commented 6 years ago

The real issue with PCG is that it is not being well-tended. I don't know anything about why it got crushed -- if there were more effort behind that family, a remedy modification may have been made but there is none. So, going with new information: I still like PCG .. to use it for what it does best .. and with the failure, that's not this.

chethega commented 6 years ago

If AES-CTR turns out to be fast enough on most architectures, then I think this would be absolutely fantastic as the default RNG.

This eradicates an entire class of security bugs and relieves people from the mental burden of considering the quality of the random stream and how to use it. And people who need something faster can still switch this out.

The only big isssue is architectures without hardware support for AES (then it would almost surely be too slow). So the bitter pill to swallow is that the random stream would become architecture dependent (e.g. older raspberry pi don't have aes acceleration and should fall back on something else).

ViralBShah commented 6 years ago

I would be ok with having a different one on different architectures.

chethega commented 6 years ago

After looking at http://www.cs.tufts.edu/comp/150FP/archive/john-salmon/parallel-random.pdf, I am more convinced that AES would be a really good default on supporting archs: Fast enough (faster than MT on a lot of CPUs, and does not eat as much L1), practically perfect quality, and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

PallHaraldsson commented 5 years ago

@ViralBShah (on xoroshiro128** or you mean on their older RNG?) "I thought those RNG don't pass the RNG testsuite." I googled, and assume you mean:

https://github.com/andreasnoackjensen/RNGTest.jl/ that I found here: https://github.com/JuliaLang/julia/issues/1391

I think the discussion should continue here, not the startup time issue, while I'm not sure the default RNG needs to pass all tests (I'm not sure it does NOT do that), if you can easily opt into a better RNG.

At http://xoshiro.di.unimi.it/

there's both e.g. xoroshiro128** and xoroshiro128+ to look into.

and I only see a minor flaw on the latter:

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007). [..] Beside BigCrush, we analyzed our generators using a test for Hamming-weight dependencies described in our paper. As we already remarked, our only generator failing the test (but only after 5 TB of output) is xoroshiro128+. [..]

jdchn commented 5 years ago

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

StefanKarpinski commented 5 years ago

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

Yes, but if you want reproducibility, you should be using an explicit RNG object instead of the default.

ViralBShah commented 5 years ago

It would be nice if @rfourquet can chime in here.

rfourquet commented 5 years ago

It would be nice if @rfourquet can chime in here.

I have not much knowledge about RNGs, but I will be happy to review and support an initiative changing the default RNG.

One possibly relevant question for a new RNG: can it generate separate streams (for different threads) that we are confident are sufficiently independant. For MT, the idea was to use the "jump" functionality (as the period is so huge that it can be easily shared among different sub-streams), and there was an issue showing that we didn't really know how independant are streams generated with given seeds (e.g. in 1:n). IIRC, for example, it's quite easy to generate independent streams with PCG.

I just have a couple other comments:

and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.

Yes, but if you want reproducibility, you should be using an explicit RNG object instead of the default.

I don't totally disagree, but I believe it would be a little nicer to keep reproducibility cross-platform if there are no high costs for that. An example where it could matter is the Primes module: random numbers are involved in factorization methods, but the numbers themselves are not relevant for the user, hence are not exposed to her and there is no way (currently) to specify an RNG. But a user may report a bug for a specific seed, and it would be unfortunate for the maintainers to not be able to reproduce it easily.

chethega commented 5 years ago

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.

True, if we take a cryptographically secure PRNG then we should have it on all platforms. I'd guess aes-ctr when available, and e.g. chacha20 otherwise. Chacha20 has a reduced-round variant for very weak CPUs, and we might consider following the random123 paper in taking a reduced-round simplified key-schedule aes variant (I'm personally against, but if people complain a lot about speed, well, still more secure and simpler than MT).

Regardless, it is probably better to provide both secure and insecure PRNG, even if both coincide (such that breakage is reduced in case of future changes).

See eg here for an example using aes intrinsics on x86, AArch64 and power8. Are there any other supported archs than ARMv7 and ancient x86 that lack aes acceleration?

Since the entire internal state fits into 3 cache lines (aes, round keys + counter) or one cache line (chacha20), we can simply give each thread an independent PRNG. For reproducibility, we could simply use something like xor(seed, thread_id()) or encrypt(key=seed, block=thread_id()) as aes keys for explicit Random.seed!(seed::Integer), and pull 16 bytes per thread from /dev/random on startup / Random.seed!().

For birthday-related reasons, we need to occasionally rekey. Most protocols conservatively rekey after 2^(L/4) blocks (L=128 for aes). I would be OK with never rekeying by default (simpler code means less opportunity for bugs with respect to reproducibility; after 2^48 blocks we get 1e-10 collision chance for true random stream, 0 collision chance for our flawed approximation of a random stream).

PallHaraldsson commented 5 years ago

TL;DR Despite what I said earlier on AES, Idon't think we should use it (nor other [hardware] cryptographically secure PRNG), but rather xoshiro256** (possibly only/rather the slightly faster xoshiro256+ is needed if rand for Float64 is most important).

AES/Randen (with hardware support; or ChaCha8 without or PCG) seems simply slower, not coming close to 0.36 cycles per byte that you can get, see the table at xoshiro.di.unimi.it

Also if we go with cryptographically secure PRNG with or without using special hardware, we can't really go back when we've made that "guarantee". I believe it's better to opt into that for those few(?) that need it. We can always reconsider changing again later, if it's actually faster.

If we do later consider cryptographically secure, it seems Google's randen based on AES should be at the top of the list:

https://github.com/google/randen

What if we could default to attack-resistant random generators without excessive CPU cost? We introduce 'Randen', a new generator with security guarantees; it outperforms MT19937, pcg64_c32, Philox, ISAAC and ChaCha8 in real-world benchmarks. This is made possible by AES hardware acceleration and a large Feistel permutation.

Related work

AES-CTR (encrypting a counter) is a well-known and easy to implement generator. It has two known weaknesses:

See also tables there on cycles per byte.

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.
True, if we take a cryptographically secure PRNG then we should have it on all platforms. I'd guess aes-ctr when available, and e.g. chacha20 otherwise. Chacha20 has a reduced-round variant for very weak CPUs, and we might consider following the random123 paper

I assume Chacha8 is is the reduced-round variant. It's just much slower than Randen (and Chacha20 5 times slower additionally), while

Having it on all platforms would mean very slow for all the cryptically secure I know of when there's no hardware support.

I repeat the interesting links (and some additional):

https://nullprogram.com/blog/2017/09/21/

August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG. [..] My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

xoshiro.di.unimi.it

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

See however http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/ is you want to consider that MT variant.

xoshiro.di.unimi.it/splitmix64.c

/* This is a fixed-increment version of Java 8's SplittableRandom generator See http://dx.doi.org/10.1145/2714064.2660195 and http://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html

It is a very fast generator passing BigCrush, and it can be useful if for some reason you absolutely want 64 bits of state; otherwise, we rather suggest to use a xoroshiro128+ (for moderately parallel computations) or xorshift1024 (for massively parallel computations) generator. /

https://springframework.guru/random-number-generation-in-java/

The java.util.Random class implements what is generally called a linear congruential generator (LCG). It is designed to be fast but does not meet requirements for real-time use

See also: https://en.wikipedia.org/wiki/Intel_Cascade_Cipher (based on AES)

https://github.com/sunoru/RandomNumbers.jl/issues/52 (related one is implemented there, and if it matters also Random123).

PallHaraldsson commented 5 years ago

Thanks to @sunoru he has implemented my preferred choice[s]: https://github.com/sunoru/RandomNumbers.jl/issues/52#issuecomment-460133618

xoshiro256** and xoshiro256+, as well as the other recommended PRNGs in their page (see c91f948)

I also noticed at:

https://en.wikipedia.org/wiki/Xorshift#Variations

xorwow [..] This performs well, but fails a few tests in BigCrush.[5] This generator is the default in Nvidia's CUDA toolkit.[6]

I'm not bringing it up because I think it should be our default. I just hadn't thought of GPU use (nor did I know of a GPU default). If we want to generate random numbers on GPUs the code better be short, and not based on AES (I assume they don't have nor any other instructions relating to [real] random).

I can maybe see a point in implementing xorwow if we need to have the same RNG on CPU and GPU. I suppose we want that but could then substitute our default RNG for Nvidia's?!

I took a quick look at what other (well one) interesting language, Rust, have chosen to implement.

https://crates.io/crates/rand

some of the dependencies are interesting, e.g. https://crates.io/crates/rand_jitter and the one implementing HC-128 (was new to me, fastest non-hardware secure one?):

http://www.ecrypt.eu.org/stream/p3ciphers/hc/hc128_p3.pdf

Statement 8. Encryption speed is 3.05 cycles/byte on Pentium M processor. [..] Stream cipher HC-128 is the simplified version of HC-256 [15] for 128-bit security. HC-128 is a simple, secure, software-efficient cipher and it is freely-available. [..]

Our analysis shows that recovering the key of HC-128 is as difficult as exhaustive key search

simonbyrne commented 5 years ago

NumPy has adopted PCG64 (https://github.com/numpy/numpy/issues/13635)

PallHaraldsson commented 4 years ago

TensorFlow chose these, or strictly TF for Swift language: https://github.com/tensorflow/swift/blob/master/RELEASES.md (link to both from that link, saying e.g. "10-round Philox4x32 PRNG" and "20-round Threefry2x32 PRNG"): "Philox and Threefry random number generators and generic random distributions."

What to make of it, I guess they liked the speed:

www.thesalmons.org/john/random123/papers/random123sc11.pdf "Threefry is the fastest Crush-resistant PRNG, requiring only common bitwise operators and in-teger addition. Threefry can be expected to perform well across a wide range of instruction set architectures."

They are interesting as of this type: https://en.wikipedia.org/wiki/Counter-based_random_number_generator_(CBRNG)

[also described there is ARS-5.]

The performance claim above may be outdated as SplitMix is newer, and so is Permuted Congruential Generator (PCG) also and Xoroshiro128+ and related even newer.

According to https://github.com/numpy/numpy/issues/13635#issuecomment-506614353 PCG64, NumPy chose, is almost twice as fast as Philox, while SFC64 is a bit faster, and I do not see them mention Threefry.

chethega commented 4 years ago

Let's not have the perfect be the enemy of the good. I agree that xoshiro variants are preferable to mersenne, in all circumstances.

With regards to cryptographically secure default RNG, I am totally fine with google's cool new alg. I think Random stdlib should export and initialize both a fast and a secure RNG. Exporting a secure and "relatively fast" RNG is important in order to not give people the excuse of minimizing dependencies.

The only question on which we differ is what should be the default RNG: Secure or fast. On most hardware platforms, secure is fast enough for code that is not super random-heavy. The only exception is old ARMv7 that lack acceleration and would need a fallback.

I am arguing that users who don't explicitly specify an RNG should get a secure one, instead of a fast one. Users who know what they care about can choose an explicit RNG. The only people who would be really hurt by this are people who do semi-heavy randomized compute on ARMv7 (not heavy enough to choose an explicit RNG, but heavy enough for random speed to matter). I am arguing that it is OK to throw these users under the bus (they may be an empty set, anyway).

On the other side, I am arguing that security-unsophisticated users who mistakenly use default-rand in security relevant contexts should be saved from themselves.

Case in point: Literally the first thing I looked at, genie.jl session key generation uses an insecure rand() call. There probably are mitigating factors (Genie.SECRET_TOKEN, hash function application might make sequence prediction impractical). But the fact that I need to even think about mitigating factors and scenarios where the config file is leaked is ridiculous.

JeffreySarnoff commented 4 years ago

If rand() becomes crypto, we should introduce fastrand() imo.

chethega commented 4 years ago

We already have a wonderful system for choosing random number generators. This is rand(rng::Random.AbstractRNG, args...), with a default rand(args...) = rand(Random.GLOBAL_RNG, args...). The words "default RNG" refer to this global constant in the Random module.

In the rand-becomes-crypto-by-default world I'm advocating for, we would have additional constants Random.FAST, Random.MERSENNE and Random.SECURE that are always initialized; and we would set Random.GLOBAL_RNG = Random.SECURE instead of the current Random.MERSENNE.

In other words, the natural definition would be fastrand(args...) = rand(Random.FAST, args...). As such, I don't see any reason for a dedicated fastrand function to exist, just like a dedicated securerand function makes imo no sense today.

JeffreySarnoff commented 4 years ago

I did not know that. Nevermind.

On Sun, Aug 18, 2019 at 1:34 PM chethega notifications@github.com wrote:

We already have a wonderful system for choosing random number generators. This is rand(rng::Random.AbstractRNG, args...), with a default rand(args...) = rand(Random.GLOBAL_RNG). The words "default RNG" refer to this global constant in the Random module.

In the rand-becomes-crypto-by-default world I'm advocating for, we would have additional constants Random.FAST, Random.MERSENNE and Random.SECURE that are always initialized; and we would set Random.GLOBAL_RNG = Random.SECURE instead of the current Random.MERSENNE.

In other words, the natural definition would be fastrand(args...) = rand(Random.FAST, args...). As such, I don't see any reason for a dedicated fastrand function to exist, just like a dedicated securerand function makes imo no sense today.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/julia/issues/27614?email_source=notifications&email_token=AAM2VRQWJLP2QBAEHWQXWHDQFGB4FA5CNFSM4FFKFZNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RERFI#issuecomment-522340501, or mute the thread https://github.com/notifications/unsubscribe-auth/AAM2VRRZ75HS73RDCNRXIALQFGB4FANCNFSM4FFKFZNA .

vigna commented 4 years ago

Sorry to jump in the conversation, but there's some talk about stuff I did. So my 2€¢ on the topic are:

Please benchmark xoshiro256++, too. In some platform/compiler is faster (no multiplications).
I wrote a detailed report of why we should stop using the Mersenne Twister—in case you're interested, you can find it on arXiv (https://arxiv.org/abs/1910.06437).
"xoroshiro128+ fails PractRand very badly", in all its vagueness, doesn't really seem to be true: there's an obvious and well known failure in binary rank, which involves the lower bits, and won't happen if you don't use, say, the lowest 8 bits or so; and there's Hamming-weight dependencies, which happen after 5TB in our test and after >32TB in PractRand, and frankly for that speed I think it's OK—the Mersenne Twister at 607 bits fails at 800MB, for example. You can use xoroshiro128++ or xoroshiro128** if you don't want those, losing ≈0.20ns (in artificial benchmarks), which seems fine.

ViralBShah commented 4 years ago

@vigna If up to you, what do you suggest as the replacement for MersenneTwister?

rfourquet commented 4 years ago

I just implemented xoshiro256++ in pure Julia, it's dead simple.

For scalar generation, compared to MersenneTwister, it's 40% faster for UInt64 generation, and 10% faster for Float64 generation (using the same technique as MT: generating in [1, 2) and substracting 1.0). It looks like a no-brainer... but:
for array generations: for filling up a vector of 10000 UInt64, it's 5% faster, but 40% slower for Float64. But this is using the default array versions, the performance may be improvable with a dedicated array version (but maybe not enough to be on par with MT).

Is one of xoshiro256++ and xoshiro256** more recommended?

vigna commented 4 years ago

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits ? That's a tradeoff that probably the user should be aware of. Also, in my tests on recent hardware I could measure no real difference (provided you precompute 1 / 2^53 and use exactly 53 bits).

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

I'm a bit surprised by your second benchmark—in principle, the conversion to float should have the same cost for everybody, so the speed comparison in an integer test should give approximately the same results of the comparison on doubles. I wouldn't mind having a look at the code if that's fine for you.

Presently I'm in favor of xoshiro256++.

rfourquet commented 4 years ago

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits

Indeed, but this was to benchmark equivalent things. There has been quite a number of discussions in Julia's dicourse / github to change that.

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

Yes, I would personally not mind including different variants, as they are so simple to implement.

in principle, the conversion to float should have the same cost for everybody

In general yes, but in this case there is a native array generation method for MT, i.e. the "kernel" (dSFMT) writes directly into the array, instead of writing into a cache array within MersenneTwister struct, and then fetching one value at a time from this cache into the destination array (this is the implementation detail of MersenneTwister in Julia!).

But there is good news, by writing naïve array generation for xoshiro256++, the array versions becomes 25% faster for UInt64, and about 15% slower for Float64 (down from 40%). This would be enough for me to adopt xoshiro256++ as the default generator in Julia, and even more so as xoshiro+ might give even better results for Float64.

BTW, thanks for having put you C code for the xoshiro family in the public domain, that makes trying it out very easy!

vigna commented 4 years ago

Ahhh OK you didn't tell me you were using the dSFMT. Yes, I acknowledge that in my shootout page:

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

If you're using extended instructions and you don't mind being able to generate just half of the possible numbers, that's OK. But remember that the dSFMT fails linearity tests, and that bias can filter through computations (see my note above) in very unexpected ways.

rfourquet commented 4 years ago

Ahhh OK you didn't tell me you were using the dSFMT.

Ah sorry! I'm used to just writing MersenneTwister in this github to mean the Julia's builtin version, which uses dSFMT.

So yes we are aware that it's not ideal to use the technique used be dSFMT to produce Float64. And regularly a user on discourse will point at this fact and be surprised about it. I don't think it's a big problem for most purposes, but better to improve that if possible.

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

I personally would prefer the "full-range" float generation at the cost of a bit of performance, but I don't know if this is a shared feeling.

rfourquet commented 4 years ago

Anyway, I vote for making xoshiro256++ the new Julia default RNG :)

And by the way it's really nice to have you chime in @vigna !

rfourquet commented 4 years ago

Plus, I forgot to mention earlier, the code to implement the jump functionality is also available, and is quite short (should be about 20 lines of code), when you compare with the equivalent for MT (which is about 150 lines of code, including simple implementation of polynomials over GF(2) !)

ViralBShah commented 4 years ago

Does our RNG policy allows us to replace the default generator? I think so, but just checking. If so, this sounds like a good idea.

vigna commented 4 years ago

To be true, I believe the Mersenne Twister comes with a generic jump implementation that makes it possible to jump ahead any number of steps: for that, you need computation of polinomials over Z/2Z modulo the characteristic polynomial, which needs code.

For speed and convenience, I provide a method for long jumps and one for large jumps. The idea is that you use large jumps to get different streams for multiple computational units, but then you can short jump and get a large number of streams in each unit (e.g., threads). Polynomials are precomputed and stored as bit arrays, but the technique is the same.

vigna commented 4 years ago

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

It would be interesting to see whether interleaving carefully computation and conversion can help there. Something like, you first fill the array and then you do a pass to convert to double (but you'll need C-style unions to do that). Or, even, having a temporary variable for the last result and at each iteration convert to double the previous result and store it while you compute the next state, and then store it in the temporary variable. Breaking dependency chains can have very surprising results sometimes.

JuliaLang / julia

Better default RNG in the future? #27614