Open PallHaraldsson opened 5 years ago
Thanks for finding that! It is so weird... I'm sorry I keep busy recently in real life, and may not have time to look into this... Maybe you could help check the lowered llvm/asm code out? And maybe also try to compare it with other RNGs in Xorshifts?
Yes, I may look into this. At least not right away.
I figured out what was wrong.
First confirming it's really fast:
julia> r2 = Xoroshiro128Plus(0x1234567890abcdef) # with a certain seed. Note that the seed must be non-zero.
Xoroshiro128Plus(0xe7eb72d97b4beac6, 0x9b86d56534ba1f9e)
julia> @btime rand($r2, UInt32) # a pitfall is skipping the $ with @btime
3.154 ns (0 allocations: 0 bytes) # this is 2.25x faster than with Base.
I was doing (and there not possible to add $):
gen_random() = UInt32(rand(r2, UInt32)) # % IM)
and then:
julia> @btime gen_random()
255.992 ns (1 allocation: 16 bytes)
This should have been the same as (with the "cast" from a type to the same type, a noop):
julia> gen_random() = rand(r2, UInt32)
gen_random (generic function with 1 method)
julia> @btime gen_random()
78.096 ns (0 allocations: 15 bytes) # note e.g. 15 vs 16 bytes
While without the cast is much faster, it also nowhere as fast as it should be.
julia> gen_random() = rand(r2, UInt64)
gen_random (generic function with 1 method)
# I think code_native should be immune to those issue (and doesn't work without wrapping in a function), and
# still the code is not as tight as I would like, so I think your code could be fixed a bit:
While @code_native has almost the same code for Uint64 and UInt32 (the latter has "leal (%rdx,%rcx), %eax"),
both versions seem to have very tight code (before I incorrectly quoted wrong code).
julia> @benchmark rand($r2, UInt64)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.147 ns (0.00% GC)
median time: 3.293 ns (0.00% GC)
mean time: 3.388 ns (0.00% GC)
maximum time: 21.765 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark rand($r2, UInt32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.157 ns (0.00% GC)
median time: 3.260 ns (0.00% GC)
mean time: 3.417 ns (0.00% GC)
maximum time: 24.740 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Thanks! Any lines missing from the post? What does the fix look like?
I took the fastest version of fasta at the Debian Benchmark Game, I think it was that one (I can send you my full version): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-julia-4.html
I thought I could make it faster with your RNG (yes, not allowed by the rules, just testing).
I changed and added below but it got slower!
time ~/julia-1.4.0-DEV-8ebe5643ca/bin/julia -O3 fasta_xor.jl 250000 >/dev/null
I thought Xoroshiro (maybe not this variant?) was one of the fastest RNG. I'll check others, but thought I should let you know as it's even slower than Julia's default too, not just compared this 18-bit non-standard RNG in the program I trying out better ones for. And if I add the modulus(%) to other RNGs (needed for the same range, as it's used in the original code), it just gets even slower, that limits numbers to 18-bit (at first I thought it to 16-bit so I was first trying such a replacement).
Version 1.1.1 of Julia doesn't change much or -O2 vs -O3.
Possibly this is because of my very old laptop (would be nice if you checked):
julia> versioninfo() Julia Version 1.4.0-DEV.17 Commit 8ebe5643ca (2019-08-20 08:32 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.1 (ORCJIT, penryn) Environment: JULIA_NUM_THREADS = 1
If this isn't just a problem with my laptop, it could be inlining related? And if the RNG in that program is simply faster, then maybe (even with it probably then a flawed RNG, broken by RNGCrush?) could be an option for your library with a clear WARNING to users.
Note, using RandomNumbers.Xorshifts is noticeably slow and you might think I'm not taking it into account, but running for a minute+ (with larger number on the command-line when calling the program) doesn't help.