JuliaRandom / RandomNumbers.jl

Random Number Generators for the Julia Language.
https://juliarandom.github.io/RandomNumbers.jl/stable
Other
97 stars 23 forks source link

FYI: Xoroshiro128Plus 2.9 times slower in real-world app (and Julia's default only 1.9 times slower) #62

Open PallHaraldsson opened 5 years ago

PallHaraldsson commented 5 years ago

I took the fastest version of fasta at the Debian Benchmark Game, I think it was that one (I can send you my full version): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-julia-4.html

I thought I could make it faster with your RNG (yes, not allowed by the rules, just testing).

I changed and added below but it got slower!

time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 fasta_xor.jl 250000 >/dev/null

I thought Xoroshiro (maybe not this variant?) was one of the fastest RNG. I'll check others, but thought I should let you know as it's even slower than Julia's default too, not just compared this 18-bit non-standard RNG in the program I trying out better ones for. And if I add the modulus(%) to other RNGs (needed for the same range, as it's used in the original code), it just gets even slower, that limits numbers to 18-bit (at first I thought it to 16-bit so I was first trying such a replacement).

const IM = UInt32(139968)
const IA = UInt32(3877)
const IC = UInt32(29573)
const last_rnd = Ref(UInt32(42))
#gen_random() = (last_rnd[] = (last_rnd[] * IA + IC) % IM)

using RandomNumbers.Xorshifts
r2 = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
# gen_random() = UInt32(rand(r2, UInt16)) #  % IM) # also slower with UInt16 there, and I think I reported for that, not the even slower:
gen_random() = UInt32(rand(r2, UInt32)) #  % IM)

Version 1.1.1 of Julia doesn't change much or -O2 vs -O3.

Possibly this is because of my very old laptop (would be nice if you checked):

julia> versioninfo() Julia Version 1.4.0-DEV.17 Commit 8ebe5643ca (2019-08-20 08:32 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.1 (ORCJIT, penryn) Environment: JULIA_NUM_THREADS = 1

If this isn't just a problem with my laptop, it could be inlining related? And if the RNG in that program is simply faster, then maybe (even with it probably then a flawed RNG, broken by RNGCrush?) could be an option for your library with a clear WARNING to users.

Note, using RandomNumbers.Xorshifts is noticeably slow and you might think I'm not taking it into account, but running for a minute+ (with larger number on the command-line when calling the program) doesn't help.

sunoru commented 5 years ago

Thanks for finding that! It is so weird... I'm sorry I keep busy recently in real life, and may not have time to look into this... Maybe you could help check the lowered llvm/asm code out? And maybe also try to compare it with other RNGs in Xorshifts?

PallHaraldsson commented 5 years ago

Yes, I may look into this. At least not right away.

PallHaraldsson commented 4 years ago

I figured out what was wrong.

First confirming it's really fast:

julia> r2 = Xoroshiro128Plus(0x1234567890abcdef)  # with a certain seed. Note that the seed must be non-zero.
Xoroshiro128Plus(0xe7eb72d97b4beac6, 0x9b86d56534ba1f9e)

julia> @btime rand($r2, UInt32)  # a pitfall is skipping the $ with @btime
  3.154 ns (0 allocations: 0 bytes)   # this is 2.25x faster than with Base.

I was doing (and there not possible to add $):

gen_random() = UInt32(rand(r2, UInt32)) #  % IM)

and then: 

julia> @btime gen_random()
  255.992 ns (1 allocation: 16 bytes)

This should have been the same as (with the "cast" from a type to the same type, a noop):

julia> gen_random() = rand(r2, UInt32)
gen_random (generic function with 1 method)

julia> @btime gen_random()
  78.096 ns (0 allocations: 15 bytes)  # note e.g. 15 vs 16 bytes

While without the cast is much faster, it also nowhere as fast as it should be.

julia> gen_random() = rand(r2, UInt64)
gen_random (generic function with 1 method)

# I think code_native should be immune to those issue (and doesn't work without wrapping in a function), and
# still the code is not as tight as I would like, so I think your code could be fixed a bit:

While @code_native has almost the same code for Uint64 and UInt32 (the latter has "leal (%rdx,%rcx), %eax"),
both versions seem to have very tight code (before I incorrectly quoted wrong code).

julia> @benchmark rand($r2, UInt64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.147 ns (0.00% GC)
  median time:      3.293 ns (0.00% GC)
  mean time:        3.388 ns (0.00% GC)
  maximum time:     21.765 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rand($r2, UInt32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.157 ns (0.00% GC)
  median time:      3.260 ns (0.00% GC)
  mean time:        3.417 ns (0.00% GC)
  maximum time:     24.740 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
sunoru commented 4 years ago

Thanks! Any lines missing from the post? What does the fix look like?