JuliaSIMD / VectorizedRNG.jl

Vectorized uniform and normal random samplers.
MIT License
33 stars 7 forks source link

Vectorized considered harmful? #4

Open PallHaraldsson opened 4 years ago

PallHaraldsson commented 4 years ago

A. First I want to start on a positive note.

Your RNG implementation is 2.5 times faster (probably just because the RNG is faster, despite vectorized).

julia> x = zeros(UInt64, 1); @btime rand!(local_rng(), $x);
  18.234 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 1); @btime rand!($x);
  46.249 ns (0 allocations: 0 bytes)

While you also beat base on Float64 (by a smaller factor), it's interesting that still your Int RNG is slower than Float generation (you do not generate Float first, what I think Base does, so understandable for their).

julia> x = zeros(Float64, 1); @btime rand!(local_rng(), $x);
  12.197 ns (0 allocations: 0 bytes)

julia> x = zeros(Float64, 1); @btime rand!($x);
  29.039 ns (0 allocations: 0 bytes)

While I remember, and int generation doesn't work with PCG (I don't care), should your tagline be changed (maybe you just started with PCG, and I maybe influenced you to add the other RNG): "Vectorized PCG uniform, normal, and exponential random samplers."

B. What you really want to show, is e.g. the possible 3.4 times gain, what you can ses in this microbenchmark, but I think maybe harmful:

julia> x = zeros(UInt64, 2048); @btime rand!(local_rng(), $x);
  1.023 μs (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 2048); @btime rand!($x);
  3.498 μs (0 allocations: 0 bytes)

Now, as you know, on discourse I've pointed to/used your package, and we could discuss this there in the open, but I want to make sure I'm not misunderstanding, before I give a warning about my program, 3x faster, mostly thanks to your package.

Someone else used this trick of precalculating the random numbers first, and I just improved that program keeping that idea. I've seen it before, maybe first on Debian's Benchmark game.

And it other do it, you are kind of forced to do so too in a microbenchmark to match other languages, since this works.

I just think this may be a very deceptive idea, and I think your vectorized package depends on it. At some point, for me around 2048 calculated random numbers it stops being faster, and gets slower with larger arrays, it seems/I guess around L1 cache amount of generated random numbers. But that's just the best case.

In all real-world programs you're going to do something with those values, and using the L1 cache for something else is valuable, and in those the right size might be much lower. So, is there a way to have your cake and eat it too?

I'm not sure what's a safe amount of generated random numbers. With just a few numbers generated you get almost all of the gain (maybe keep next for generated values in registers only?). But I think in programs where you use this idea, I'm not sure you can do away with a small buffer in memory.

julia> x = zeros(UInt64, 1); @btime rand!(local_rng(), $x);
  18.276 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 2); @btime rand!(local_rng(), $x);
  18.259 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 4); @btime rand!(local_rng(), $x);
  18.209 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 8); @btime rand!(local_rng(), $x);
  21.283 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 16); @btime rand!(local_rng(), $x);
  25.688 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 32); @btime rand!(local_rng(), $x); 
  32.816 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 64); @btime rand!(local_rng(), $x);
  50.996 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 128); @btime rand!(local_rng(), $x);
  85.410 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 256); @btime rand!(local_rng(), $x);
  144.156 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 512); @btime rand!(local_rng(), $x);
  266.631 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 1024); @btime rand!(local_rng(), $x);
  517.667 ns (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 2048); @btime rand!(local_rng(), $x);
  1.023 μs (0 allocations: 0 bytes)

# somewhere around here, problem, or in real-world programs much sooner

julia> x = zeros(UInt64, 4096); @btime rand!(local_rng(), $x);
  2.224 μs (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 8192); @btime rand!(local_rng(), $x);
  6.119 μs (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 1000000); @btime rand!(local_rng(), $x);
  847.457 μs (0 allocations: 0 bytes)

julia> x = zeros(UInt64, 2048000); @btime rand!(local_rng(), $x);
  2.147 ms (0 allocations: 0 bytes)
PallHaraldsson commented 4 years ago

This issue kept me up last night, I had this nagging feeling I was giving bad advice, at https://discourse.julialang.org/t/best-way-to-optimize-code/37980/26 (I've followed up with the OP).

FYI, this is potentially a real issue for the code in that thread where my program, using your package, that used to be 3x faster goes to 80% slower, 2x and I guess slower still (without bounds? I just didn't have the patience to try larger values), with a one line change:

@time begin
       nsim = 10^3;
       omega = 1000;  # was omega = 100, then faster, and I tried changing
[..]
 67.861310 seconds (2.49 G allocations: 56.125 GiB, 3.85% gc time)

for omega=2000
276.214053 seconds (10.98 G allocations: 238.843 GiB, 5.59% gc time)

still more allocations for the original code:
139.043652 seconds (19.97 M allocations: 62.674 GiB, 2.09% gc time)

I can still use your package (e.g. in that program for speedup, without the scalability issue.

chriselrod commented 4 years ago

Benchmarking very short vectors (e.g., length == 1)

Why is Float64 faster than UInt? There is overhead in using the random number generator, and this overhead is a function of how much it is unrolled. This is because, for uniform Float64 generation, it unrolls by 2, and for random UInt or random normal generation, it unrolls by 4. I'll call this the UF, for unroll-factor.

What this means is that, when you call rand(n)!, it will load 4UF * VectorizedRNG.W64 numbers from memory, and then store them again. W64 is how many integers fit in a SIMD register, assuming your computer has the AVX2 instruction set. This package will be slow if it does not have AVX2. I assume yours does, and that VectorizedRNG.W64 == 4 based on your timings: no speed difference between vectors of length 1 and 4.

The W64 represents how many numbers your computer is able to operate on with a single instruction. However, your CPU can also issue multiple instructions in parallel. This is where the UF helps: a single core can work on updating multiple sets of W64 state variables in parallel to generate random numbers.

Of course, the higher the unroll factor, the more sets of state that have to be loaded from memory and then stored again when it is done. This loading takes longer than turning an integer to a floating point number, hence the floating point version will be faster until your vectors are long enough to amortize that cost.

chriselrod commented 4 years ago

Deceptive buffers?

Now, as you know, on discourse I've pointed to/used your package, and we could discuss this there in the open, but I want to make sure I'm not misunderstanding, before I give a warning about my program, 3x faster, mostly thanks to your package.

Someone else used this trick of precalculating the random numbers first, and I just improved that program keeping that idea. I've seen it before, maybe first on Debian's Benchmark game.

And it other do it, you are kind of forced to do so too in a microbenchmark to match other languages, since this works.

I just think this may be a very deceptive idea, and I think your vectorized package depends on it.

I'm not sure what you mean here. If I understand you correctly, that is what Julia's default random number generator does. fill_array! (defined above) calculates huge numbers of random numbers at a time (at least 384). Otherwise, it'll copy from a saved buffer, updating it if necessary.

chriselrod commented 4 years ago

Memory bandwidth

At some point, for me around 2048 calculated random numbers it stops being faster, and gets slower with larger arrays, it seems/I guess around L1 cache amount of generated random numbers. But that's just the best case.

In all real-world programs you're going to do something with those values, and using the L1 cache for something else is valuable, and in those the right size might be much lower.

Memory bandwidth is much slower than the CPU. You need to do a huge amount of computation relative to the amount of memory, otherwise every programming streaming through memory will be bottle-necked by memory bandwidth, and computation speed becomes irrelevant. Hence, MT, a Vectorized Xoshiro, or a scalar Xoshiro will all yield equal performance.

I could help this by checking for array size, and using nontemporal stores if the array is large. Nontemporal stores prevent the store from being cached, improving performance in these types of scenarios. You can see how they improve the performance for vmapnt! over vmap! here, for example.

chriselrod commented 4 years ago

How many numbers to generate?

In all real-world programs you're going to do something with those values, and using the L1 cache for something else is valuable, and in those the right size might be much lower. So, is there a way to have your cake and eat it too?

Good question. When it comes to VectorizedRNG, memory bandwidth should be less of a problem than with the MersenneTwister, because it can more efficiently generate smaller batches of numbers, and the Xoshiro's state is itself much smaller, so it should contend with memory less than the twister.

Although, as always, benchmark to compare if possible.

chriselrod commented 4 years ago

Clock speed vs Vectorization

The README of this library shows RNGTest.bigcrushJulia was faster with Random.MersenneTwister() than it was with VectorizedRNG.local_rng(). I should actually monitor clock speeds, but I suspect it was a combination of famed CPU-downclocking and penalized performance on transitions.

For example, my computer runs at 4.1 GHz for AVX512, 4.3 GHz for AVX(2), and 4.6 GHz for non-AVX code. dSFMT is non-AVX, letting it run at 4.6 GHz.

If a CPU is in non-AVX mode and encounters AVX code, it will run at the same clock speed but something like 1/4 of the instructions per clock (IPC) for around a 9 microseconds (that is 4x slower!) and then not execute anything at all for 11 microseconds, before switching modes and resuming at full IPC at the reduced clock speed.

20 microseconds is a long time! Your vector of length 2048 took about 2.2 microseonds. If it were 4x slower because of 1/4 IPC, it might almost hit the point to transition and pause for another 11. Then maybe it'll hit non-avx code, and run at reduced GHz for long enough to clock up again before returning to the AVX code...

What is the rest of the code doing? I honestly have no idea. But speculation is easy. Performance is complicated, and things like this (and the cache contention) can make micro-benchamrks non-representative of "whole program" performance.

Just like it'd likely be better for the RNG to not run in AVX-mode if the rest of the program doesn't, if the rest of the program is running AVX instructions, it'd be better for the RNG to run AVX too. I'd expect -- but it'll be a while before I can benchmark in larger applications -- that this plays friendlier with well optimized programs than dSFMT for this reason (and pressuring cache less).

This is all just speculation on my part, based on articles like this one.

PallHaraldsson commented 3 years ago

Hi again,

I can't believe I didn't write back at the time. I meant to, just saying I like your very good responses. I recall I had some comments too in mind, why I may have delayed.

Anyway, I was looking this up since I was in contact with Vigna, after his work landed in Julia finally. In short, is this package more or less now outdated (while still working), since it's the same RNG? And while there's SIMD in Julia too, and I believe you did some of the work, is it mostly copy paste from this code or substantially different?

chriselrod commented 3 years ago

Anyway, I was looking this up since I was in contact with Vigna, after his work landed in Julia finally. In short, is this package more or less now outdated (while still working), since it's the same RNG?

Hi @PallHaraldsson , not necessarily. VectorizedRNG.jl is still much faster for me:

julia> using Random, VectorizedRNG

julia> x = Vector{Float64}(undef, 1024);

julia> lrng = local_rng(); drng = Random.default_rng(); typeof(drng)
TaskLocalRNG

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 728; memory estimate: 0 bytes; allocs estimate: 0
ns

 (173.73 - 174.15]  ██████████████████████████████ 8728
 (174.15 - 174.56]   0
 (174.56 - 174.98]   0
 (174.98 - 175.4 ]  ███▊1085
 (175.4  - 175.81]  ▍99
 (175.81 - 176.23]  ▎69
 (176.23 - 176.64]  ▏4
 (176.64 - 177.06]  ▏1
 (177.06 - 177.48]   0
 (177.48 - 177.89]   0
 (177.89 - 178.31]  ▏2
 (178.31 - 178.73]   0
 (178.73 - 179.14]   0
 (179.14 - 179.56]  ▏2
 (179.56 - 224.13]  ▏10

                  Counts

min: 173.729 ns (0.00% GC); mean: 174.005 ns (0.00% GC); median: 173.788 ns (0.00% GC); max: 224.135 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 369; memory estimate: 0 bytes; allocs estimate: 0
ns

 (252.0 - 253.0]  ██████████████████████████████ 8317
 (253.0 - 253.9]  ██▊729
 (253.9 - 254.9]  ▏1
 (254.9 - 255.8]  ██▏586
 (255.8 - 256.8]  █▏278
 (256.8 - 257.7]  ▎60
 (257.7 - 258.7]  ▏11
 (258.7 - 259.7]  ▏1
 (259.7 - 260.6]   0
 (260.6 - 261.6]  ▏1
 (261.6 - 262.5]  ▏2
 (262.5 - 263.5]   0
 (263.5 - 264.4]  ▏1
 (264.4 - 265.4]  ▏3
 (265.4 - 384.4]  ▏10

                  Counts

min: 252.008 ns (0.00% GC); mean: 253.063 ns (0.00% GC); median: 252.713 ns (0.00% GC); max: 384.358 ns (0.00% GC).

julia> @benchmark randn!($lrng, $x)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1058.0 - 1078.0]  ██████████████████████████████ 7161
 (1078.0 - 1098.0]  ███████████▌2723
 (1098.0 - 1118.0]   0
 (1118.0 - 1138.0]   0
 (1138.0 - 1158.0]   0
 (1158.0 - 1178.0]  ▏25
 (1178.0 - 1198.0]  ▍64
 (1198.0 - 1218.0]  ▏7
 (1218.0 - 1238.0]  ▏4
 (1238.0 - 1258.0]  ▏3
 (1258.0 - 1278.0]  ▏1
 (1278.0 - 1298.0]  ▏1
 (1298.0 - 1318.0]   0
 (1318.0 - 1338.0]  ▏1
 (1338.0 - 2830.0]  ▏10

                  Counts

min: 1.058 μs (0.00% GC); mean: 1.076 μs (0.00% GC); median: 1.076 μs (0.00% GC); max: 2.829 μs (0.00% GC).

julia> @benchmark randn!($drng, $x)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2234.0 - 2285.0]  ▊115
 (2285.0 - 2335.0]  █████████████▊2384
 (2335.0 - 2386.0]  ██████████████████████████████ 5208
 (2386.0 - 2437.0]  ███████████▎1940
 (2437.0 - 2488.0]  █▌240
 (2488.0 - 2539.0]  ▌75
 (2539.0 - 2589.0]  ▏19
 (2589.0 - 2640.0]  ▏2
 (2640.0 - 2691.0]  ▏1
 (2691.0 - 2742.0]  ▏3
 (2742.0 - 2793.0]   0
 (2793.0 - 2843.0]   0
 (2843.0 - 2894.0]  ▏1
 (2894.0 - 2945.0]  ▏2
 (2945.0 - 5269.0]  ▏10

                  Counts

min: 2.234 μs (0.00% GC); mean: 2.363 μs (0.00% GC); median: 2.359 μs (0.00% GC); max: 5.269 μs (0.00% GC).

julia> versioninfo()
Julia Version 1.7.0-DEV.1258
Commit 97f446baf3* (2021-06-06 20:55 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake)

This computer has AVX512. Results look very different on an M1-Mac (running natively):

julia> using Random, VectorizedRNG

julia> x = Vector{Float64}(undef, 1024);

julia> lrng = local_rng(); drng = Random.default_rng(); typeof(drng)
TaskLocalRNG

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 129; memory estimate: 0 bytes; allocs estimate: 0
ns

 (736.4 - 743.5]  ██████████████████████████████ 9571
 (743.5 - 750.5]  ▏5
 (750.5 - 757.5]  ▏2
 (757.5 - 764.6]  ▉257
 (764.6 - 771.6]  ▎72
 (771.6 - 778.7]  ▏18
 (778.7 - 785.7]  ▏1
 (785.7 - 792.7]  ▏10
 (792.7 - 799.8]  ▏29
 (799.8 - 806.8]  ▏2
 (806.8 - 813.8]  ▏1
 (813.8 - 820.9]   0
 (820.9 - 827.9]  ▏8
 (827.9 - 835.0]  ▏14
 (835.0 - 937.0]  ▏10

                  Counts

min: 736.434 ns (0.00% GC); mean: 739.276 ns (0.00% GC); median: 738.047 ns (0.00% GC); max: 937.016 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 155; memory estimate: 0 bytes; allocs estimate: 0
ns

 (673.9 - 679.8]  ██████████████████████████████ 9530
 (679.8 - 685.7]  ▏12
 (685.7 - 691.6]  ▏15
 (691.6 - 697.5]  ▉262
 (697.5 - 703.4]  ▍84
 (703.4 - 709.3]  ▏17
 (709.3 - 715.2]  ▏2
 (715.2 - 721.1]  ▏16
 (721.1 - 727.0]  ▏20
 (727.0 - 732.9]  ▏7
 (732.9 - 738.8]  ▏2
 (738.8 - 744.7]  ▏1
 (744.7 - 750.6]  ▏3
 (750.6 - 756.5]  ▏19
 (756.5 - 797.6]  ▏10

                  Counts

min: 673.923 ns (0.00% GC); mean: 676.059 ns (0.00% GC); median: 674.729 ns (0.00% GC); max: 797.581 ns (0.00% GC).

julia> @benchmark randn!($lrng, $x)
samples: 10000; evals/sample: 8; memory estimate: 0 bytes; allocs estimate: 0
ns

 (3000.0 - 3100.0]  ██████████████████████████████ 9923
 (3100.0 - 3200.0]  ▏2
 (3200.0 - 3300.0]  ▏2
 (3300.0 - 3390.0]  ▏18
 (3390.0 - 3490.0]  ▏14
 (3490.0 - 3590.0]  ▏12
 (3590.0 - 3690.0]  ▏1
 (3690.0 - 3790.0]  ▏2
 (3790.0 - 3890.0]  ▏5
 (3890.0 - 3990.0]  ▏2
 (3990.0 - 4080.0]  ▏1
 (4080.0 - 4180.0]  ▏1
 (4180.0 - 4280.0]  ▏1
 (4280.0 - 4380.0]  ▏6
 (4380.0 - 4990.0]  ▏10

                  Counts

min: 3.000 μs (0.00% GC); mean: 3.023 μs (0.00% GC); median: 3.016 μs (0.00% GC); max: 4.989 μs (0.00% GC).

julia> @benchmark randn!($drng, $x)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1617.0 - 1667.0]  ▎18
 (1667.0 - 1717.0]  ████▏579
 (1717.0 - 1767.0]  ████████████████████████▏3450
 (1767.0 - 1817.0]  ██████████████████████████████4296
 (1817.0 - 1867.0]  ██████████▌1494
 (1867.0 - 1917.0]  █131
 (1917.0 - 1967.0]  ▏3
 (1967.0 - 2017.0]   0
 (2017.0 - 2067.0]   0
 (2067.0 - 2117.0]  ▏2
 (2117.0 - 2167.0]  ▏7
 (2167.0 - 2217.0]  ▏5
 (2217.0 - 2267.0]  ▏3
 (2267.0 - 2317.0]  ▏2
 (2317.0 - 2950.0]  ▏10

                  Counts

min: 1.617 μs (0.00% GC); mean: 1.778 μs (0.00% GC); median: 1.775 μs (0.00% GC); max: 2.950 μs (0.00% GC).

julia> versioninfo()
Julia Version 1.7.0-DEV.1302
Commit f03832028a* (2021-06-10 20:51 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.5.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

I need to tune this library to work better on architectures without good SIMD performance. For example, I should probably only use Box-Muller for normal variables on CPUs with >= 64 byte vectors, and use (scalar) Ziggurat otherwise. But even the default tasklocal RNG is faster than VectorizedRNG.local_rng() in the M1, so I need to make adjustments there, too.

And while there's SIMD in Julia too, and I believe you did some of the work, is it mostly copy paste from this code or substantially different?

Chethega and Jeff Bezanson did all the real work on SIMD there. I think they're "substantially different". In VectorizedRNG, the canonical state is SIMD, with separate streams per vector lane. This reduces the overhead when starting vector-mode generation, but adds to the overhead when initializing. VectorizedRNG's state is initialized on when the package is loaded. I believe the task-local RNG's is initialized on a per-task basis instead. Threaded code is likely to spawn many times the tasks than you have threads, meaning you have many more initialization for the default rng, and thus reducing its cost is important.

chriselrod commented 3 years ago

But you can see the advantage of lower overhead on starting SIMD generation by sampling shorter vectors. The M1 (ARM NEON):

julia> x = Vector{Float64}(undef, 128);

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 964; memory estimate: 0 bytes; allocs estimate: 0
ns

 (94.1  - 95.0 ]  ██████████████████████████████ 9630
 (95.0  - 95.9 ]  ▏4
 (95.9  - 96.9 ]  ▏1
 (96.9  - 97.8 ]  ▊211
 (97.8  - 98.7 ]  ▎75
 (98.7  - 99.7 ]  ▏15
 (99.7  - 100.6]   0
 (100.6 - 101.5]  ▏6
 (101.5 - 102.5]  ▏27
 (102.5 - 103.4]  ▏4
 (103.4 - 104.3]   0
 (104.3 - 105.3]   0
 (105.3 - 106.2]  ▏2
 (106.2 - 107.1]  ▏16
 (107.1 - 109.4]  ▏9

                  Counts

min: 94.052 ns (0.00% GC); mean: 94.444 ns (0.00% GC); median: 94.269 ns (0.00% GC); max: 109.354 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 930; memory estimate: 0 bytes; allocs estimate: 0
ns

 (111.1 - 112.1]  ██████████████████████████████ 9513
 (112.1 - 113.1]  ▏5
 (113.1 - 114.0]  ▏1
 (114.0 - 115.0]  █278
 (115.0 - 116.0]  ▎56
 (116.0 - 116.9]  ▎53
 (116.9 - 117.9]   0
 (117.9 - 118.9]  ▏17
 (118.9 - 119.8]  ▏37
 (119.8 - 120.8]  ▏7
 (120.8 - 121.8]   0
 (121.8 - 122.7]  ▏2
 (122.7 - 123.7]  ▏3
 (123.7 - 124.7]  ▏18
 (124.7 - 127.6]  ▏10

                  Counts

min: 111.111 ns (0.00% GC); mean: 111.490 ns (0.00% GC); median: 111.246 ns (0.00% GC); max: 127.643 ns (0.00% GC).

julia> x = Vector{Float64}(undef, 127);

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 965; memory estimate: 0 bytes; allocs estimate: 0
ns

 (94.2  - 95.2 ]  ██████████████████████████████ 9615
 (95.2  - 96.1 ]  ▏5
 (96.1  - 97.0 ]  ▏2
 (97.0  - 98.0 ]  ▊209
 (98.0  - 98.9 ]  ▎80
 (98.9  - 99.9 ]  ▏10
 (99.9  - 100.8]  ▏2
 (100.8 - 101.7]  ▏10
 (101.7 - 102.7]  ▏24
 (102.7 - 103.6]  ▏6
 (103.6 - 104.6]  ▏2
 (104.6 - 105.5]  ▏1
 (105.5 - 106.4]  ▏2
 (106.4 - 107.4]  ▏23
 (107.4 - 123.1]  ▏9

                  Counts

min: 94.213 ns (0.00% GC); mean: 94.581 ns (0.00% GC); median: 94.388 ns (0.00% GC); max: 123.057 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 923; memory estimate: 0 bytes; allocs estimate: 0
ns

 (113.5 - 114.5]  ██████████████████████████████ 9531
 (114.5 - 115.5]  ▏1
 (115.5 - 116.5]  ▏2
 (116.5 - 117.5]  █279
 (117.5 - 118.5]  ▎74
 (118.5 - 119.5]  ▏34
 (119.5 - 120.5]  ▏2
 (120.5 - 121.5]  ▏8
 (121.5 - 122.5]  ▏28
 (122.5 - 123.5]  ▏7
 (123.5 - 124.5]  ▏1
 (124.5 - 125.5]  ▏4
 (125.5 - 126.5]  ▏6
 (126.5 - 127.5]  ▏13
 (127.5 - 130.1]  ▏10

                  Counts

min: 113.533 ns (0.00% GC); mean: 113.985 ns (0.00% GC); median: 113.759 ns (0.00% GC); max: 130.101 ns (0.00% GC).

The 10980XE (AVX512):

julia> x = Vector{Float64}(undef, 128);

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 996; memory estimate: 0 bytes; allocs estimate: 0
ns

 (23.42 - 23.61]  ██████████████████████████████ 9748
 (23.61 - 23.81]  ▏12
 (23.81 - 24.0 ]   0
 (24.0  - 24.19]   0
 (24.19 - 24.38]   0
 (24.38 - 24.57]  ▋167
 (24.57 - 24.77]  ▏40
 (24.77 - 24.96]  ▏6
 (24.96 - 25.15]  ▏7
 (25.15 - 25.34]  ▏6
 (25.34 - 25.53]  ▏1
 (25.53 - 25.73]  ▏1
 (25.73 - 25.92]  ▏1
 (25.92 - 26.11]  ▏1
 (26.11 - 58.54]  ▏10

                  Counts

min: 23.422 ns (0.00% GC); mean: 23.502 ns (0.00% GC); median: 23.449 ns (0.00% GC); max: 58.538 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 968; memory estimate: 0 bytes; allocs estimate: 0
ns

 (77.63 - 78.03 ]  ▏21
 (78.03 - 78.43 ]  ▏38
 (78.43 - 78.83 ]  ██████████████████████████████ 9131
 (78.83 - 79.23 ]  ▏32
 (79.23 - 79.63 ]  ▏6
 (79.63 - 80.04 ]  ██▍692
 (80.04 - 80.44 ]  ▎49
 (80.44 - 80.84 ]  ▏16
 (80.84 - 81.24 ]  ▏2
 (81.24 - 81.64 ]   0
 (81.64 - 82.04 ]  ▏1
 (82.04 - 82.44 ]   0
 (82.44 - 82.85 ]  ▏1
 (82.85 - 83.25 ]  ▏1
 (83.25 - 113.52]  ▏10

                  Counts

min: 77.626 ns (0.00% GC); mean: 78.804 ns (0.00% GC); median: 78.710 ns (0.00% GC); max: 113.522 ns (0.00% GC).

julia> x = Vector{Float64}(undef, 127);

julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 996; memory estimate: 0 bytes; allocs estimate: 0
ns

 (23.21 - 23.37]  ██████████████████████████████ 9733
 (23.37 - 23.52]  ▏19
 (23.52 - 23.68]   0
 (23.68 - 23.83]   0
 (23.83 - 23.99]   0
 (23.99 - 24.14]   0
 (24.14 - 24.29]  ▏3
 (24.29 - 24.45]  ▌162
 (24.45 - 24.6 ]  ▎48
 (24.6  - 24.76]  ▏16
 (24.76 - 24.91]  ▏1
 (24.91 - 25.07]  ▏6
 (25.07 - 25.22]  ▏1
 (25.22 - 25.37]  ▏1
 (25.37 - 57.93]  ▏10

                  Counts

min: 23.214 ns (0.00% GC); mean: 23.361 ns (0.00% GC); median: 23.315 ns (0.00% GC); max: 57.930 ns (0.00% GC).

julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 957; memory estimate: 0 bytes; allocs estimate: 0
ns

 (89.22 - 89.65 ]  █████▎1355
 (89.65 - 90.08 ]  ██████████████████████████████ 7761
 (90.08 - 90.51 ]  ▏12
 (90.51 - 90.95 ]  ██▏526
 (90.95 - 91.38 ]  █▏286
 (91.38 - 91.81 ]  ▎45
 (91.81 - 92.24 ]  ▏2
 (92.24 - 92.67 ]   0
 (92.67 - 93.1  ]  ▏1
 (93.1  - 93.53 ]  ▏1
 (93.53 - 93.96 ]   0
 (93.96 - 94.39 ]   0
 (94.39 - 94.82 ]   0
 (94.82 - 95.25 ]  ▏1
 (95.25 - 123.91]  ▏10

                  Counts

min: 89.222 ns (0.00% GC); mean: 89.881 ns (0.00% GC); median: 89.765 ns (0.00% GC); max: 123.913 ns (0.00% GC).