Better default RNG in the future?

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG." I previously quoted: ~~"The clear winner is xoroshiro128+"~~

https://nullprogram.com/blog/2017/09/21/

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. ..

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

To be true, I believe the Mersenne Twister comes with a generic jump implementation that makes it possible to jump ahead any number of steps

Ah right, I forgot about that.

It would be interesting to see whether interleaving carefully computation and conversion can help there.

Actually, there is no cost in Julia to convert a UInt64 to Float64. I still tried you suggestion to first do the bit-twidling work for the whole array in a loop, and then to substract 1.0 from all positions, but the performance are quite worse. I also indeed tried a bit to see how to "break dependency chains", but without success in this case. Basically at each iteration, bit-twidling and one store in the array must happen.

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

For me, it's totally fine as is and potential improvements could be addressed later, but let's see what other people think.

Does our RNG policy allows us to replace the default generator?

I would believe so. We explicitly documented that the generated streams of numbers can change, and we don't say anything about the stability of the type of the default rng. The only semi-official thing was to use Random.GLOBAL_RNG in the cases when you implement new functions which need to take an optional RNG, but even that got its type changed in 1.3 (it's now of type Random._GLOBAL_RNG instead of MersenneTwister).

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

Yes, that was what I was telling you. I think that FPUs have a very very fast execution path if you just have the significant bits and you shift them in. It might be slower, for example, if you don't do the >>11 shift and divide by 2.0^-64, because I think that in that case there's some rounding going on.

Actually, there is no cost in Julia to convert a UInt64 to Float64.

This is kind of true: the reinterpret itself is a no-op, but in order to do most integer operations on floating-point values in hardware, it's necessary to move the value from a floating-point register to an integer register (and back if you want to continue using as floating-point), which does take a cycle in each direction, so two cycles to go back and forth. So it's not free in practice.

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

On my machine I get down to ~0.14 cpb:

julia dsfmt 1024 x UInt64
  2.089 μs (0 allocations: 0 bytes)
ref impl, 1 x interleaved, 1024 x UInt64
  1.579 μs (0 allocations: 0 bytes)
ref impl, 4 x interleaved, 1024 x UInt64
  2.150 μs (0 allocations: 0 bytes)
julia 1 x interleaved 1 x SIMD, 1024 x UInt64
  4.336 μs (0 allocations: 0 bytes)
  1.838 μs (0 allocations: 0 bytes)
julia 1 x interleaved 2 x SIMD, 1024 x UInt64
  2.709 μs (0 allocations: 0 bytes)
  1.384 μs (0 allocations: 0 bytes)
julia 1 x interleaved 4 x SIMD, 1024 x UInt64
  1.487 μs (0 allocations: 0 bytes)
  709.241 ns (0 allocations: 0 bytes)
julia 1 x interleaved 8 x SIMD, 1024 x UInt64
  927.750 ns (0 allocations: 0 bytes)
  604.148 ns (0 allocations: 0 bytes)
julia 1 x interleaved 16 x SIMD, 1024 x UInt64

with beautiful code (allmost no branches, everything 256 bit wide; if I had a AVX512 machine, this would be even better).

CF https://gist.github.com/chethega/344a8fe464a4c1cade20ae101c49360a

Can anyone link what xoshiro code you benchmarked? Maybe I just suck at C?

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

Wow, now you're making me shiver. 0.14 cycle/B? The vectorization is performed by the compiler or did you handcraft it (sorry, totally not proficient with Julia)?

I'm running an Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz, i.e. a pretty anemic laptop broadwell.

The vectorization was by hand, the code is in the gist, but small enough to post here:

using SIMD, BenchmarkTools
mutable struct xstate{N}
    s0::Vec{N, UInt64}
    s1::Vec{N, UInt64}
    s2::Vec{N, UInt64}
    s3::Vec{N, UInt64}
end
macro new(T)
    return Expr(:new, T)
end

function init(xs::xstate)
    r=rand(UInt8, sizeof(xs))
    GC.@preserve xs, r
        ccall(:memcpy, Cvoid, (Ptr{Nothing}, Ptr{Nothing}, Csize_t), pointer_from_objref(xs), pointer(r), sizeof(xs))
    nothing
end

@inline rotl(a, k) = (a<<k) | (a>>(64-k))

@inline function next_nostore((s0, s1, s2, s3))
    res = rotl(s0 + s3, 23) + s0
    t = s1 << 17
    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3
    s2 ⊻= t
    s3 = rotl(s3, 45)
    return (res, (s0, s1, s2, s3))
end

function XfillSimd_nostore(dst, xs::xstate{N}, num) where N
    lane = VecRange{N}(0)
    s = xs.s0, xs.s1, xs.s2, xs.s3
    @inbounds for i=1:N:num
        dst[lane+i], s = next_nostore(s)
    end
    xs.s0, xs.s1, xs.s2, xs.s3 = s
    nothing
end
N=1024; dst=zeros(UInt64, N);
xs = @new xstate{8}; init(xs)
@btime XfillSimd_nostore($dst, $xs, $N)

Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat.

This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four <8 x UInt64> vectors.

PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:

L144:
    vpsrlq  $41, %ymm4, %ymm6
    vpsllq  $23, %ymm4, %ymm4
    vpor    %ymm4, %ymm6, %ymm4
    vpaddq  %ymm0, %ymm4, %ymm4
    vmovdqa %ymm5, %ymm0
    movq    (%rdi), %rcx
    vmovdqu %ymm4, -8(%rcx,%rdx,8)
    cmpq    %rdx, %rax
    je  L266
    addq    $4, %rdx
    vpaddq  %ymm0, %ymm2, %ymm4
    vpsllq  $17, %ymm1, %ymm6
    vpxor   %ymm0, %ymm3, %ymm3
    vpxor   %ymm1, %ymm2, %ymm2
    vpxor   %ymm1, %ymm3, %ymm1
    vpxor   %ymm0, %ymm2, %ymm5
    vpxor   %ymm6, %ymm3, %ymm3
    vpsllq  $45, %ymm2, %ymm6
    vpsrlq  $19, %ymm2, %ymm2
    vpor    %ymm6, %ymm2, %ymm2
    jns L144

So, after consideration:

The proposed setup, if we wanted to switch to xoshiro256++, would be to reserve 5 cache-lines of RNG state (ach thread has its own completely independent generator). One CL is for sequential generation, and laid out like xstate{2}, with an implementation that only advances the first of the RNGs for 1-8 byte outputs, and advances both for 16-byte outputs. This is used in code that uses rand(some_small_bitstype).

Then, there are 4 cache-lines for bulk generations, laid out like xstate{8}.

These 4 cachelines aren't hot for code that doesn't use bulk generation, so nobody cares about state-size.

There sequential generator uses xstate{2} layout because I have no better idea what to do with the remaining half of the line, so we may well use it to speed up 16byte-wide generation; and there is no reason to compactify the layout of the primary generator (sequential generation of <= 8 byte numbers), since we can't use wide SIMD loads anyway.

The sequential generator only exists because: (1) sequential generation should only fill in a single cache-line of state, and (2) bulk generation wants to use wide vector loads/stores for the state. Code that uses both sequential and bulk generation will therefore fill 5 instead of 4 lines; doesn't matter, who cares.

The sequential generator should be properly inlined (verify that, and possibly add @inline). Unfortunately I see no good way of getting the compiler to optimize out the stores to xstate in tight loops; I assume that this is in order to ensure correct behaviour when the store traps? But we don't deal with segfaults anyway, so maybe there is some magic llvm annotation that tells llvm to not care about correct xstate after trapping? I am not opposed to writing the whole RNG in a giant llvmcall.

There is no reason to implement jump. We only need seed! and output generation.

How do we tell the allocator to give us 64-byte aligned mutable struct?

In total, I think this would be a good move, both for speed and in order to cut out the dependency. One would need to port a variant of my code over to Base-julia (i.e. without SIMD.jl) and verify that the resulting @code_native is OK on all supported archs, and that the 8 x wide SIMD doesn't hurt perf on weaker machines (i.e. check that the compiler doesn't spill registers). Someone with access to AVX512, POWER and various ARM should chime in at some point. Worst case we can stick with 4 x wide simd and save 2 cache lines. Downside is slightly lower perf with AVX2 and presumably massive perf degradation with AVX512. SIMD widths are increasing all the time, so being future-proof without breaking random stream would be nice.

On the other hand, if we change the default RNG at all, I would still prefer a cryptographic choice; if that is rejected and if scientific consensus says that xoshiro gives "good enough" streams, then that's good enough for me.

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable.

Hey, I just ported naïvely @vigna's code to Julia, I wanted to have an actualized performance comparison compared to last year and with xoshiro256++ specifically. It was good enough for me, and I currently don't have the skills to write simd-optimized code! :D

What you did is amazing, with your benchmark above, this gets 2.6 times faster than my array generation routine. If this is what you asked, here is my code:

using Random

mutable struct Xoropp <: AbstractRNG
    s0::UInt64
    s1::UInt64
    s2::UInt64
    s3::UInt64
end

rotl(x::UInt64, k::UInt) = (x << k) | (x >> ((64 % UInt) - k))

function Base.rand(rng::Xoropp, ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    result = rotl(s0 + s3, UInt(23)) + s0

    t = s1 << 17

    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3

    s2 ⊻= t

    s3 = rotl(s3, UInt(45))

    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3

    result
end

# array generation code: nothing magic, but somehow gets quite faster than the default
# code for array, which does `rand(rng, UInt64)` in a loop
function Random.rand!(rng::Xoropp, A::AbstractArray{UInt64},
                      ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    @inbounds for i = eachindex(A)
        A[i] = rotl(s0 + s3, UInt(23)) + s0

        t = s1 << 17

        s2 ⊻= s0
        s3 ⊻= s1
        s1 ⊻= s2
        s0 ⊻= s3

        s2 ⊻= t

        s3 = rotl(s3, UInt(45))
    end
    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
    A
end

# for Float64 generation
Random.rng_native_52(::Xoropp) = UInt64

I can't really comment on your "proposed setup" above, as I have not much ideas above cache lines and such.

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

That is correct!

In even more detail, each thread would have 10 independent xoshiro instances. 1-64 bit sequential user requests would advance rng_1, 65+ bit sequential user requests would advance rng_1 and rng_2, and all bulk user requests would advance rng_3-rng_10. rng_1 and rng_2 would be laid out together in an xstate{2}, and the bulk generators would be laid out in an xstate{8}.

Hey, I just ported naïvely

Sorry, I didn't want to impugn your code. You do awesome work in keeping julia's random infrastructure together, and your APIs rock!

What you did is amazing

That's too much praise ;) I just read the assembly, noticed that it sucks, had the audacity to propose 10 interleaved xoshiro instances instead of a single one, and passed a Vec{8, UInt64} instead of a single UInt64 into the RNG function. Praise belongs more to @vigna for xoshiro and @eschnett for SIMD.jl. I guess we should also port the bulk generator to C, such that @vigna can use it as well. Would be a fun exercise in intel intrinsics (alternative: copy-paste the generated julia code and make it inline assembly. While at it, we can also clean it up: I see no reason that the hot loop has two conditional branches when we could do with a single one, probably some weird size minimization).

PS:

array generation code: nothing magic, but somehow gets quite faster than the default code for array, which does rand(rng, UInt64) in a loop

That is because you hoisted writing and reading the state out of the loop. This is a super important optimization, and is the main property of my nostore variant. I feel fine with this for things with a pointer, but am unsure about AbstractArray: If the setindex! calls rand again, then we would generate duplicate random numbers. We don't know the setindex! implementation: Maybe the AbstractArray is really backed by a very complicated sparse array that internally uses randomized algorithms?

That is because you hoisted writing and reading the state out of the loop

I assumed so, but was surprised that the compiler doesn't do this seemingly easy optimization by itself. But your point about setindex! being allowed to interleave a call to the same generator helps me understand why this is not trivial. (The faster version which assumes no weird setindex! should probably be made available to array implementations in some way).

Also, I realized that in my benchmarks above, I compared xoshiro++ to a local instance of MersenneTwister, not to the global RNG. In 1.3, with threading, its performance got a serious hit (by more than 2x), so it would be interesting to test xoshiro++ as the default global RNG.

So, please help me: why won't this be vectorized by gcc or clang? Here XOSHIRO256_UNROLL=8 and I'm using -O3 -mavx2...

static inline uint64_t next(void) {
    uint64_t t[XOSHIRO256_UNROLL];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s0[i] + s3[i], R) + s0[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s1[i] >> A;

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= s0[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] ^= s1[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s1[i] ^= s2[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s0[i] ^= s3[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= t[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] = rotl(s3[i], B);
    // This is just to avoid dead-code elimination
    return t[0] ^ t[XOSHIRO256_UNROLL-1];
}

Try running clang with -Rpass-missed=loop-vectorize or -Rpass-analysis=loop-vectorize. That should print out debugging information showing which loops were not vectorized, which statements stopped vectorization, etc...

Reference: https://llvm.org/docs/Vectorizers.html#diagnostics

I got this from clang:

./harness.c:83:11: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize] uint64_t e = -1;

But this is the external benchmarking loop. There's never a warning saying it's not optimizing the loops above. :(

From what I've read the problem would be that I'm using memory and not registers (i.e., variables aren't local).

Have you tried with -Rpass=loop-vectorize to see if it actually does think it's vectorizing the loop?

@vigna I don't know why your code didn't vectorize, but I am quite happy with clang's output for the following. The generated assembly is even nicer to read than julia's. Feel free to add this to your reference implementation, no attribution required.

#include <stdint.h>

#define XOSHIRO256_UNROLL 4

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}

typedef struct {uint64_t s[4][XOSHIRO256_UNROLL];} xstate;

xstate global_state;

#define next_macro(result, s) { \
    uint64_t t[XOSHIRO256_UNROLL]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s[0][i] + s[3][i], 23) + s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s[1][i] << 17; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] ^= s[1][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[1][i] ^= s[2][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[0][i] ^= s[3][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= t[i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] = rotl(s[3][i], 45); \
} \

void rngbulk(uint64_t* restrict dst, xstate* restrict state, unsigned int len){
    xstate slocal = *state;
    uint64_t result[XOSHIRO256_UNROLL];
    unsigned int len_trunc = len - (len % XOSHIRO256_UNROLL);
    for(int j = 0; j < len_trunc; j += XOSHIRO256_UNROLL){
        next_macro( (result) , (slocal.s) );
        for(int k=0; k<XOSHIRO256_UNROLL; k++) { dst[j+k] = result[k]; }
    }

    if(len_trunc != len){
        next_macro( (result) , (slocal.s) );
        for(int k = 0; k < len-len_trunc; k++){ dst[len_trunc + k] = result[k]; }
    }
    *state = slocal;
}

void rngbulk_globstate(uint64_t* restrict dst, unsigned int len){ rngbulk(dst, &global_state, len);}

edit: removed memory corruption if len is not divisble by XOSHIRO256_UNROLL.

Ah OK! Just switching to s[4][XOSHIRO256_UNROLL] instead of s0[XOSHIRO256_UNROLL], etc. made my code vectorize with gcc 💪🏻.

Now I have to understand why clang does yours and not mine, but we're getting there...

Please note that the >> A should be << A. 🤦🏻‍♂️

And, most fundamental, it will not happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

For me, GCC 8.3 generates fast(er) code with -O2 -ftree-vectorize -mavx2 and bad code with -O3 -ftree-vectorize -mavx2 for @vigna's XOSHIRO256_UNROLL (the bad code is about 33% slower).

With -O2 -ftree-vectorize -mavx2 I get:

4010ea: c5 fd 6f 2d 8e 2f 00    vmovdqa 0x2f8e(%rip),%ymm5        # 404080 <s>
  4010f1:   00 
  4010f2:   48 89 d8                mov    %rbx,%rax
  4010f5:   c5 fd 6f 0d 43 30 00    vmovdqa 0x3043(%rip),%ymm1        # 404140 <s+0xc0>
  4010fc:   00 
  4010fd:   c5 fd 6f 25 9b 2f 00    vmovdqa 0x2f9b(%rip),%ymm4        # 4040a0 <s+0x20>
  401104:   00 
  401105:   c5 fd 6f 05 53 30 00    vmovdqa 0x3053(%rip),%ymm0        # 404160 <s+0xe0>
  40110c:   00 
  40110d:   48 8d 93 00 00 01 00    lea    0x10000(%rbx),%rdx
  401114:   c5 fd 6f 1d a4 2f 00    vmovdqa 0x2fa4(%rip),%ymm3        # 4040c0 <s+0x40>
  40111b:   00 
  40111c:   c5 fd 6f 15 bc 2f 00    vmovdqa 0x2fbc(%rip),%ymm2        # 4040e0 <s+0x60>
  401123:   00 
  401124:   0f 1f 40 00             nopl   0x0(%rax)
  401128:   c5 d5 d4 f1             vpaddq %ymm1,%ymm5,%ymm6
  40112c:   c5 bd 73 f2 11          vpsllq $0x11,%ymm2,%ymm8
  401131:   c5 e5 ef c9             vpxor  %ymm1,%ymm3,%ymm1
  401135:   48 83 c0 40             add    $0x40,%rax
  401139:   c5 fd 7f 70 c0          vmovdqa %ymm6,-0x40(%rax)
  40113e:   c5 dd d4 f0             vpaddq %ymm0,%ymm4,%ymm6
  401142:   c5 ed ef c0             vpxor  %ymm0,%ymm2,%ymm0
  401146:   c5 d5 ef 3d b2 2f 00    vpxor  0x2fb2(%rip),%ymm5,%ymm7        # 404100 <s+0x80>
  40114d:   00 
  40114e:   c5 fd 7f 70 e0          vmovdqa %ymm6,-0x20(%rax)
  401153:   c5 f5 ef ed             vpxor  %ymm5,%ymm1,%ymm5
  401157:   c5 b5 73 f3 11          vpsllq $0x11,%ymm3,%ymm9
  40115c:   c5 dd ef 35 bc 2f 00    vpxor  0x2fbc(%rip),%ymm4,%ymm6        # 404120 <s+0xa0>
  401163:   00 
  401164:   c5 fd ef e4             vpxor  %ymm4,%ymm0,%ymm4
  401168:   c5 c5 ef db             vpxor  %ymm3,%ymm7,%ymm3
  40116c:   c4 c1 45 ef f9          vpxor  %ymm9,%ymm7,%ymm7
  401171:   c5 7d 7f 0c 24          vmovdqa %ymm9,(%rsp)
  401176:   c5 cd ef d2             vpxor  %ymm2,%ymm6,%ymm2
  40117a:   c4 c1 4d ef f0          vpxor  %ymm8,%ymm6,%ymm6
  40117f:   c5 7d 7f 44 24 20       vmovdqa %ymm8,0x20(%rsp)
  401185:   c5 fd 7f 35 93 2f 00    vmovdqa %ymm6,0x2f93(%rip)        # 404120 <s+0xa0>
  40118c:   00 
  40118d:   c5 cd 73 f1 2d          vpsllq $0x2d,%ymm1,%ymm6
  401192:   c5 f5 73 d1 13          vpsrlq $0x13,%ymm1,%ymm1
  401197:   c5 cd eb c9             vpor   %ymm1,%ymm6,%ymm1
  40119b:   c5 cd 73 f0 2d          vpsllq $0x2d,%ymm0,%ymm6
  4011a0:   c5 fd 7f 1d 18 2f 00    vmovdqa %ymm3,0x2f18(%rip)        # 4040c0 <s+0x40>
  4011a7:   00 
  4011a8:   c5 fd 73 d0 13          vpsrlq $0x13,%ymm0,%ymm0
  4011ad:   c5 fd 7f 15 2b 2f 00    vmovdqa %ymm2,0x2f2b(%rip)        # 4040e0 <s+0x60>
  4011b4:   00 
  4011b5:   c5 cd eb c0             vpor   %ymm0,%ymm6,%ymm0
  4011b9:   c5 fd 7f 2d bf 2e 00    vmovdqa %ymm5,0x2ebf(%rip)        # 404080 <s>
  4011c0:   00 
  4011c1:   c5 fd 7f 25 d7 2e 00    vmovdqa %ymm4,0x2ed7(%rip)        # 4040a0 <s+0x20>
  4011c8:   00 
  4011c9:   c5 fd 7f 3d 2f 2f 00    vmovdqa %ymm7,0x2f2f(%rip)        # 404100 <s+0x80>
  4011d0:   00 
  4011d1:   c5 fd 7f 0d 67 2f 00    vmovdqa %ymm1,0x2f67(%rip)        # 404140 <s+0xc0>
  4011d8:   00 
  4011d9:   c5 fd 7f 05 7f 2f 00    vmovdqa %ymm0,0x2f7f(%rip)        # 404160 <s+0xe0>

But with -O3 -ftree-vectorize -mavx2 I get:

 4010bf:    48 8b 1d ea 2f 00 00    mov    0x2fea(%rip),%rbx        # 4040b0 <s+0x30>
  4010c6:   48 8b 05 b3 2f 00 00    mov    0x2fb3(%rip),%rax        # 404080 <s>
  4010cd:   4c 8b 05 6c 30 00 00    mov    0x306c(%rip),%r8        # 404140 <s+0xc0>
  4010d4:   48 8b 3d 6d 30 00 00    mov    0x306d(%rip),%rdi        # 404148 <s+0xc8>
  4010db:   48 89 9c 24 b0 00 00    mov    %rbx,0xb0(%rsp)
  4010e2:   00 
  4010e3:   48 8b 1d 86 30 00 00    mov    0x3086(%rip),%rbx        # 404170 <s+0xf0>
  4010ea:   48 89 84 24 b8 00 00    mov    %rax,0xb8(%rsp)
  4010f1:   00 
  4010f2:   48 8b 05 8f 2f 00 00    mov    0x2f8f(%rip),%rax        # 404088 <s+0x8>
  4010f9:   48 89 9c 24 a8 00 00    mov    %rbx,0xa8(%rsp)
  401100:   00 
  401101:   48 8b 1d b0 2f 00 00    mov    0x2fb0(%rip),%rbx        # 4040b8 <s+0x38>
  401108:   48 89 84 24 80 00 00    mov    %rax,0x80(%rsp)
  40110f:   00 
  401110:   48 8b 05 79 2f 00 00    mov    0x2f79(%rip),%rax        # 404090 <s+0x10>
  401117:   48 89 9c 24 a0 00 00    mov    %rbx,0xa0(%rsp)
  40111e:   00 
  40111f:   48 8b 1d 52 30 00 00    mov    0x3052(%rip),%rbx        # 404178 <s+0xf8>
  401126:   48 89 44 24 78          mov    %rax,0x78(%rsp)
  40112b:   48 8b 05 66 2f 00 00    mov    0x2f66(%rip),%rax        # 404098 <s+0x18>
  401132:   48 89 9c 24 98 00 00    mov    %rbx,0x98(%rsp)
  401139:   00 
  40113a:   48 8b 1d 7f 2f 00 00    mov    0x2f7f(%rip),%rbx        # 4040c0 <s+0x40>
  401141:   48 89 44 24 70          mov    %rax,0x70(%rsp)
  401146:   48 8b 05 53 2f 00 00    mov    0x2f53(%rip),%rax        # 4040a0 <s+0x20>
  40114d:   48 89 9c 24 f8 00 00    mov    %rbx,0xf8(%rsp)
  401154:   00 
  401155:   48 8b 1d 6c 2f 00 00    mov    0x2f6c(%rip),%rbx        # 4040c8 <s+0x48>
  40115c:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
  401163:   00 
  401164:   48 8b 05 3d 2f 00 00    mov    0x2f3d(%rip),%rax        # 4040a8 <s+0x28>
  40116b:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  401172:   00 
  401173:   48 8b 1d 56 2f 00 00    mov    0x2f56(%rip),%rbx        # 4040d0 <s+0x50>
  40117a:   48 89 84 24 90 00 00    mov    %rax,0x90(%rsp)
  401181:   00 
  401182:   48 8b 35 c7 2f 00 00    mov    0x2fc7(%rip),%rsi        # 404150 <s+0xd0>
  401189:   48 8b 05 d8 2f 00 00    mov    0x2fd8(%rip),%rax        # 404168 <s+0xe8>
  401190:   48 8b 0d c1 2f 00 00    mov    0x2fc1(%rip),%rcx        # 404158 <s+0xd8>
  401197:   48 89 9c 24 c8 00 00    mov    %rbx,0xc8(%rsp)
  40119e:   00 
  40119f:   48 8b 15 ba 2f 00 00    mov    0x2fba(%rip),%rdx        # 404160 <s+0xe0>
  4011a6:   48 8b 1d 2b 2f 00 00    mov    0x2f2b(%rip),%rbx        # 4040d8 <s+0x58>
  4011ad:   c7 44 24 1c 00 04 00    movl   $0x400,0x1c(%rsp)
  4011b4:   00 
  4011b5:   48 89 44 24 68          mov    %rax,0x68(%rsp)
  4011ba:   4c 8b 3d 3f 2f 00 00    mov    0x2f3f(%rip),%r15        # 404100 <s+0x80>
  4011c1:   48 89 9c 24 e8 00 00    mov    %rbx,0xe8(%rsp)
  4011c8:   00 
  4011c9:   48 8b 1d 10 2f 00 00    mov    0x2f10(%rip),%rbx        # 4040e0 <s+0x60>
  4011d0:   4c 8b 35 31 2f 00 00    mov    0x2f31(%rip),%r14        # 404108 <s+0x88>
  4011d7:   4c 8b 2d 32 2f 00 00    mov    0x2f32(%rip),%r13        # 404110 <s+0x90>
  4011de:   48 89 9c 24 d0 00 00    mov    %rbx,0xd0(%rsp)
  4011e5:   00 
  4011e6:   48 8b 1d fb 2e 00 00    mov    0x2efb(%rip),%rbx        # 4040e8 <s+0x68>
  4011ed:   4c 8b 25 24 2f 00 00    mov    0x2f24(%rip),%r12        # 404118 <s+0x98>
  4011f4:   4c 8b 1d 2d 2f 00 00    mov    0x2f2d(%rip),%r11        # 404128 <s+0xa8>
  4011fb:   48 89 9c 24 d8 00 00    mov    %rbx,0xd8(%rsp)
  401202:   00 
  401203:   48 8b 1d e6 2e 00 00    mov    0x2ee6(%rip),%rbx        # 4040f0 <s+0x70>
  40120a:   4c 8b 15 1f 2f 00 00    mov    0x2f1f(%rip),%r10        # 404130 <s+0xb0>
  401211:   4c 8b 0d 20 2f 00 00    mov    0x2f20(%rip),%r9        # 404138 <s+0xb8>
  401218:   48 89 9c 24 e0 00 00    mov    %rbx,0xe0(%rsp)
  40121f:   00 
  401220:   48 8b 1d d1 2e 00 00    mov    0x2ed1(%rip),%rbx        # 4040f8 <s+0x78>
  401227:   48 89 9c 24 f0 00 00    mov    %rbx,0xf0(%rsp)
  40122e:   00 
  40122f:   48 8b 1d ea 2e 00 00    mov    0x2eea(%rip),%rbx        # 404120 <s+0xa0>
  401236:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40123d:   00 00 00 
  401240:   48 8b 84 24 f8 00 00    mov    0xf8(%rsp),%rax
  401247:   00 
  401248:   4c 33 bc 24 b8 00 00    xor    0xb8(%rsp),%r15
  40124f:   00 
  401250:   4c 33 b4 24 80 00 00    xor    0x80(%rsp),%r14
  401257:   00 
  401258:   4c 33 6c 24 78          xor    0x78(%rsp),%r13
  40125d:   48 c1 e0 11             shl    $0x11,%rax
  401261:   4c 33 64 24 70          xor    0x70(%rsp),%r12
  401266:   48 33 9c 24 88 00 00    xor    0x88(%rsp),%rbx
  40126d:   00 
  40126e:   48 89 44 24 58          mov    %rax,0x58(%rsp)
  401273:   48 8b 84 24 c0 00 00    mov    0xc0(%rsp),%rax
  40127a:   00 
  40127b:   4c 33 9c 24 90 00 00    xor    0x90(%rsp),%r11
  401282:   00 
  401283:   4c 33 94 24 b0 00 00    xor    0xb0(%rsp),%r10
  40128a:   00 
  40128b:   48 c1 e0 11             shl    $0x11,%rax
  40128f:   4c 33 8c 24 a0 00 00    xor    0xa0(%rsp),%r9
  401296:   00 
  401297:   48 33 bc 24 c0 00 00    xor    0xc0(%rsp),%rdi
  40129e:   00 
  40129f:   48 89 44 24 50          mov    %rax,0x50(%rsp)
  4012a4:   48 8b 84 24 c8 00 00    mov    0xc8(%rsp),%rax
  4012ab:   00 
  4012ac:   48 33 b4 24 c8 00 00    xor    0xc8(%rsp),%rsi
  4012b3:   00 
  4012b4:   48 33 8c 24 e8 00 00    xor    0xe8(%rsp),%rcx
  4012bb:   00 
  4012bc:   48 c1 e0 11             shl    $0x11,%rax
  4012c0:   4c 33 84 24 f8 00 00    xor    0xf8(%rsp),%r8
  4012c7:   00 
  4012c8:   48 89 44 24 48          mov    %rax,0x48(%rsp)
  4012cd:   48 8b 84 24 e8 00 00    mov    0xe8(%rsp),%rax
  4012d4:   00 
  4012d5:   48 c1 e0 11             shl    $0x11,%rax
  4012d9:   48 89 44 24 40          mov    %rax,0x40(%rsp)
  4012de:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
  4012e5:   00 
  4012e6:   48 c1 e0 11             shl    $0x11,%rax
  4012ea:   48 89 44 24 38          mov    %rax,0x38(%rsp)
  4012ef:   48 8b 84 24 d8 00 00    mov    0xd8(%rsp),%rax
  4012f6:   00 
  4012f7:   48 c1 e0 11             shl    $0x11,%rax
  4012fb:   48 89 44 24 30          mov    %rax,0x30(%rsp)
  401300:   48 8b 84 24 e0 00 00    mov    0xe0(%rsp),%rax
  401307:   00 
  401308:   48 c1 e0 11             shl    $0x11,%rax
  40130c:   48 89 44 24 28          mov    %rax,0x28(%rsp)
  401311:   48 8b 84 24 f0 00 00    mov    0xf0(%rsp),%rax
  401318:   00 
  401319:   48 c1 e0 11             shl    $0x11,%rax
  40131d:   48 89 44 24 20          mov    %rax,0x20(%rsp)
  401322:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
  401329:   00 
  40132a:   48 31 d0                xor    %rdx,%rax
  40132d:   48 89 44 24 60          mov    %rax,0x60(%rsp)
  401332:   48 8b 44 24 68          mov    0x68(%rsp),%rax
  401337:   48 33 84 24 d8 00 00    xor    0xd8(%rsp),%rax
  40133e:   00 
  40133f:   48 8b 94 24 a8 00 00    mov    0xa8(%rsp),%rdx
  401346:   00 
  401347:   48 33 94 24 e0 00 00    xor    0xe0(%rsp),%rdx
  40134e:   00 
  40134f:   4c 31 bc 24 f8 00 00    xor    %r15,0xf8(%rsp)
  401356:   00 
  401357:   48 89 94 24 a8 00 00    mov    %rdx,0xa8(%rsp)
  40135e:   00 
  40135f:   48 8b 94 24 98 00 00    mov    0x98(%rsp),%rdx
  401366:   00 
  401367:   48 33 94 24 f0 00 00    xor    0xf0(%rsp),%rdx
  40136e:   00 
  40136f:   4c 31 b4 24 c0 00 00    xor    %r14,0xc0(%rsp)
  401376:   00 
  401377:   4c 31 ac 24 c8 00 00    xor    %r13,0xc8(%rsp)
  40137e:   00 
  40137f:   4c 31 a4 24 e8 00 00    xor    %r12,0xe8(%rsp)
  401386:   00 
  401387:   48 31 9c 24 d0 00 00    xor    %rbx,0xd0(%rsp)
  40138e:   00 
  40138f:   4c 31 9c 24 d8 00 00    xor    %r11,0xd8(%rsp)
  401396:   00 
  401397:   4c 31 94 24 e0 00 00    xor    %r10,0xe0(%rsp)
  40139e:   00 
  40139f:   4c 31 8c 24 f0 00 00    xor    %r9,0xf0(%rsp)
  4013a6:   00 
  4013a7:   48 89 94 24 98 00 00    mov    %rdx,0x98(%rsp)
  4013ae:   00 
  4013af:   48 8b 54 24 60          mov    0x60(%rsp),%rdx
  4013b4:   4c 31 84 24 b8 00 00    xor    %r8,0xb8(%rsp)
  4013bb:   00 
  4013bc:   49 c1 c8 13             ror    $0x13,%r8
  4013c0:   48 31 bc 24 80 00 00    xor    %rdi,0x80(%rsp)
  4013c7:   00 
  4013c8:   48 c1 cf 13             ror    $0x13,%rdi
  4013cc:   48 31 74 24 78          xor    %rsi,0x78(%rsp)
  4013d1:   48 c1 ce 13             ror    $0x13,%rsi
  4013d5:   48 31 4c 24 70          xor    %rcx,0x70(%rsp)
  4013da:   48 c1 c9 13             ror    $0x13,%rcx
  4013de:   48 31 94 24 88 00 00    xor    %rdx,0x88(%rsp)
  4013e5:   00 
  4013e6:   48 8b 94 24 a8 00 00    mov    0xa8(%rsp),%rdx
  4013ed:   00 
  4013ee:   48 31 84 24 90 00 00    xor    %rax,0x90(%rsp)
  4013f5:   00 
  4013f6:   48 c1 c8 13             ror    $0x13,%rax
  4013fa:   4c 33 7c 24 58          xor    0x58(%rsp),%r15
  4013ff:   4c 33 74 24 50          xor    0x50(%rsp),%r14
  401404:   4c 33 6c 24 48          xor    0x48(%rsp),%r13
  401409:   48 89 44 24 68          mov    %rax,0x68(%rsp)
  40140e:   48 8b 84 24 a8 00 00    mov    0xa8(%rsp),%rax
  401415:   00 
  401416:   48 31 94 24 b0 00 00    xor    %rdx,0xb0(%rsp)
  40141d:   00 
  40141e:   48 8b 94 24 98 00 00    mov    0x98(%rsp),%rdx
  401425:   00 
  401426:   48 31 94 24 a0 00 00    xor    %rdx,0xa0(%rsp)
  40142d:   00 
  40142e:   48 c1 c8 13             ror    $0x13,%rax
  401432:   48 8b 54 24 60          mov    0x60(%rsp),%rdx
  401437:   4c 33 64 24 40          xor    0x40(%rsp),%r12
  40143c:   48 89 84 24 a8 00 00    mov    %rax,0xa8(%rsp)
  401443:   00 
  401444:   48 8b 84 24 98 00 00    mov    0x98(%rsp),%rax
  40144b:   00 
  40144c:   48 33 5c 24 38          xor    0x38(%rsp),%rbx
  401451:   4c 33 5c 24 30          xor    0x30(%rsp),%r11
  401456:   48 c1 ca 13             ror    $0x13,%rdx
  40145a:   48 c1 c8 13             ror    $0x13,%rax
  40145e:   4c 33 54 24 28          xor    0x28(%rsp),%r10
  401463:   4c 33 4c 24 20          xor    0x20(%rsp),%r9
  401468:   ff 4c 24 1c             decl   0x1c(%rsp)
  40146c:   48 89 84 24 98 00 00    mov    %rax,0x98(%rsp)
  401473:   00 
  401474:   0f 85 c6 fd ff ff       jne    401240 <main+0x1d0>
  40147a:   c5 fa 7e 54 24 78       vmovq  0x78(%rsp),%xmm2
  401480:   48 8b 44 24 68          mov    0x68(%rsp),%rax
  401485:   c4 e3 e9 22 4c 24 70    vpinsrq $0x1,0x70(%rsp),%xmm2,%xmm1
  40148c:   01 
  40148d:   c5 fa 7e 9c 24 b8 00    vmovq  0xb8(%rsp),%xmm3
  401494:   00 00 
  401496:   c4 e3 e1 22 84 24 80    vpinsrq $0x1,0x80(%rsp),%xmm3,%xmm0
  40149d:   00 00 00 01 
  4014a1:   c5 fa 7e a4 24 b0 00    vmovq  0xb0(%rsp),%xmm4
  4014a8:   00 00 
  4014aa:   c5 fa 7e ac 24 88 00    vmovq  0x88(%rsp),%xmm5
  4014b1:   00 00 
  4014b3:   c5 fa 7e b4 24 c8 00    vmovq  0xc8(%rsp),%xmm6
  4014ba:   00 00 
  4014bc:   c5 fa 7e bc 24 f8 00    vmovq  0xf8(%rsp),%xmm7
  4014c3:   00 00 
  4014c5:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4014cb:   c4 e3 d9 22 8c 24 a0    vpinsrq $0x1,0xa0(%rsp),%xmm4,%xmm1
  4014d2:   00 00 00 01 
  4014d6:   c4 c1 f9 6e e5          vmovq  %r13,%xmm4
  4014db:   c5 fd 7f 05 9d 2b 00    vmovdqa %ymm0,0x2b9d(%rip)        # 404080 <s>
  4014e2:   00 
  4014e3:   c4 e3 d1 22 84 24 90    vpinsrq $0x1,0x90(%rsp),%xmm5,%xmm0
  4014ea:   00 00 00 01 
  4014ee:   c4 c1 f9 6e ef          vmovq  %r15,%xmm5
  4014f3:   c5 fa 7e 94 24 e0 00    vmovq  0xe0(%rsp),%xmm2
  4014fa:   00 00 
  4014fc:   c5 fa 7e 9c 24 d0 00    vmovq  0xd0(%rsp),%xmm3
  401503:   00 00 
  401505:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40150b:   c4 e3 c9 22 8c 24 e8    vpinsrq $0x1,0xe8(%rsp),%xmm6,%xmm1
  401512:   00 00 00 01 
  401516:   c4 c1 f9 6e f2          vmovq  %r10,%xmm6
  40151b:   c5 fd 7f 05 7d 2b 00    vmovdqa %ymm0,0x2b7d(%rip)        # 4040a0 <s+0x20>
  401522:   00 
  401523:   c4 e3 c1 22 84 24 c0    vpinsrq $0x1,0xc0(%rsp),%xmm7,%xmm0
  40152a:   00 00 00 01 
  40152e:   c4 e1 f9 6e fb          vmovq  %rbx,%xmm7
  401533:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401539:   c4 e3 e9 22 8c 24 f0    vpinsrq $0x1,0xf0(%rsp),%xmm2,%xmm1
  401540:   00 00 00 01 
  401544:   c4 e1 f9 6e d6          vmovq  %rsi,%xmm2
  401549:   c5 fd 7f 05 6f 2b 00    vmovdqa %ymm0,0x2b6f(%rip)        # 4040c0 <s+0x40>
  401550:   00 
  401551:   c4 e3 e1 22 84 24 d8    vpinsrq $0x1,0xd8(%rsp),%xmm3,%xmm0
  401558:   00 00 00 01 
  40155c:   c4 c1 f9 6e d8          vmovq  %r8,%xmm3
  401561:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401567:   c4 c3 d9 22 cc 01       vpinsrq $0x1,%r12,%xmm4,%xmm1
  40156d:   c5 fa 7e a4 24 a8 00    vmovq  0xa8(%rsp),%xmm4
  401574:   00 00 
  401576:   c5 fd 7f 05 62 2b 00    vmovdqa %ymm0,0x2b62(%rip)        # 4040e0 <s+0x60>
  40157d:   00 
  40157e:   c4 c3 d1 22 c6 01       vpinsrq $0x1,%r14,%xmm5,%xmm0
  401584:   c4 e1 f9 6e ea          vmovq  %rdx,%xmm5
  401589:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40158f:   c4 c3 c9 22 c9 01       vpinsrq $0x1,%r9,%xmm6,%xmm1
  401595:   c5 fd 7f 05 63 2b 00    vmovdqa %ymm0,0x2b63(%rip)        # 404100 <s+0x80>
  40159c:   00 
  40159d:   c4 c3 c1 22 c3 01       vpinsrq $0x1,%r11,%xmm7,%xmm0
  4015a3:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015a9:   c4 e3 e9 22 c9 01       vpinsrq $0x1,%rcx,%xmm2,%xmm1
  4015af:   c5 fd 7f 05 69 2b 00    vmovdqa %ymm0,0x2b69(%rip)        # 404120 <s+0xa0>
  4015b6:   00 
  4015b7:   c4 e3 e1 22 c7 01       vpinsrq $0x1,%rdi,%xmm3,%xmm0
  4015bd:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015c3:   c4 e3 d9 22 8c 24 98    vpinsrq $0x1,0x98(%rsp),%xmm4,%xmm1
  4015ca:   00 00 00 01 
  4015ce:   c5 fd 7f 05 6a 2b 00    vmovdqa %ymm0,0x2b6a(%rip)        # 404140 <s+0xc0>
  4015d5:   00 
  4015d6:   c4 e3 d1 22 c0 01       vpinsrq $0x1,%rax,%xmm5,%xmm0
  4015dc:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015e2:   c5 fd 7f 05 76 2b 00    vmovdqa %ymm0,0x2b76(%rip)        # 404160 <s+0xe0>

And, most fundamental, it will not happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

Are you using gcc? Most often when I get vectorization with -O2 -ftree-vectorize but not -O3, I do get it with -O3 -fdisable-tree-cunrolli. This happens with manually unrolled code.

In VectorizedRNG.jl I have a vectorized PCG implementation. It is a lot slower than xoshiro. With AVX2, it is also about 50% slower than the default MersenneTwister, but it is faster with AVX512. Perhaps also of interest is that I implemented a SIMD random normal generator using the Box-Muller algorithm. It is close to 1.5x and 4x faster than the default with AVX2 and AVX512, respectively. I suspect it can be improved further.

@chriselrod, despite me proposing a new PRNG, at least for some applications, e.g. for neural networks (there QRNG seems to have the best case, and for others strangely making worse?), NO PRNG may be good enough (if accuracy, not just speed, is your concern). It's at least good to have in mind.

On the effects of pseudorandom and quantum-random number generators in soft computing https://link.springer.com/article/10.1007%2Fs00500-019-04450-0

We use a CPU and a QPU to generate random numbers for multiple machine learning techniques. Random numbers are employed in the random initial weight distributions of dense and convolutional neural networks, in which results show a profound difference in learning patterns for the two. [..] In 50 dense neural networks (25 PRNG/25 QRNG), QRNG increases over PRNG [e.g.] QRNG exceeded PRNG for mental state EEG classification by + 2.82%. [..] In CIFAR-10, the QRNG outperforms PRNG by + 0.92%. The n-random split of a Random Tree is enhanced towards and new Quantum Random Tree (QRT) model, which has differing classification abilities to its classical counterpart [..] Finally, the QRT is ensembled into a Quantum Random Forest (QRF), which also has a noticeable effect when compared to the standard Random Forest (RF). [..] Although quantum, quantum-inspired and hybrid classical/quantum algorithms are explored, as well as the likewise methods for computing, the use of a Quantum Random Number Generator is rarely explored within a classical machine learning approach in which an RNG is required [..] Major issues arise with the possibility of backdoors, notably, for example, Intel’s pseudorandom generator which, after hijacking, allowed for complete control of a computer system for malicious intent Degabriele et al. (2016), Schneier (2007). The Intel issue was far from a lone incident, the RANDU system was cracked by the NSA for unprecedented access to the RSA BSAFE cryptographic library, as well as in 2006 when Debian OpenSSL’s random number generator was also cracked, leading to Debian being compromised for 2 years Markowsky (2014).

The paper is about QRNG (which could be a drop-in-replacemnt in Julia, but also an interesting overview of Quantum Machine Learning in general, with e.g.:

This section outlines some of the state-of-the-art applications of quantum theory in computing. [..] Quantum Perceptrons are a theoretical approach to deriving a quantum equivalent of a perceptron unit (neuron) within an Artificial Neural Network Schuld et al. (2014). [..] A very limited extent of research suggests quantum effects in a network to be the possible source of consciousness Hameroff and Penrose (1996)

Offtopic, but intriguing (and maybe related if QRNG used?): Quantum Image Processing and Its Application to Edge Detection: Theory and Experiment https://arxiv.org/abs/1801.01465

at least for some applications, e.g. for neural networks (there QRNG seems to have the best case, and for others strangely making worse?), NO PRNG may be good enough

I find this hard to believe, a cryptographically secure PRNG has to be undistinguishable from true randomness. It would be quite a statement to claim that no such PRNG exists.

JuliaLang / julia

Better default RNG in the future? #27614