Open pcjentsch opened 3 years ago
I don't have plans, but should at least add SIMD exponential sampling. Note that you may have to change your algorithm for SIMD sampling.
For example, the ziggurat algorithm is the most popular method for drawing normal samples, but here I use the Box-Muller method here because it is easier to write a SIMD version.
Worst comes to worst, you can generate random uniformally distributed numbers (either UInt64
over the 2^64
possibilities, or Float64
over (0,1)
) with SIMD, and then use these to produce Poisson samples or other distributions. With this approach, you should already be able to sample from distributions defined in Distributions.jl
:
julia> using Distributions, VectorizedRNG
julia> rand(local_rng(), Poisson(11.0))
13
julia> rand(local_rng(), Poisson(11.0))
16
julia> @benchmark rand(local_rng(), Poisson(11.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 291.111 ns (0.00% GC)
median time: 299.355 ns (0.00% GC)
mean time: 301.583 ns (0.00% GC)
maximum time: 669.811 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 296
julia> @benchmark rand(Poisson(11.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 291.471 ns (0.00% GC)
median time: 301.865 ns (0.00% GC)
mean time: 303.235 ns (0.00% GC)
maximum time: 441.989 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 274
But if most of the time is spent in the serial conversion of uniform -> Poisson samples, you may not benefit much from faster uniform generation.
julia> @benchmark rand()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.127 ns (0.00% GC)
median time: 4.654 ns (0.00% GC)
mean time: 4.870 ns (0.00% GC)
maximum time: 8.713 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark rand(local_rng())
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.668 ns (0.00% GC)
median time: 1.769 ns (0.00% GC)
mean time: 1.795 ns (0.00% GC)
maximum time: 6.398 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Although I haven't looked at how the Poisson sampler is implemented, so I'm not sure how many uniform samples it typically draws per poisson sample.
I'd be happy to accept contributions and to answer any questions.
I see, thanks!
It does seem to be a bit faster for large vectors of samples, at least.
julia> @benchmark rand(local_rng(), Poisson(11.0), 5000)
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 12.430 μs (0.00% GC)
median time: 16.900 μs (0.00% GC)
mean time: 16.990 μs (0.00% GC)
maximum time: 26.290 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> ^C
julia> @benchmark rand(Poisson(11.0), 5000)
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 17.260 μs (0.00% GC)
median time: 21.450 μs (0.00% GC)
mean time: 21.522 μs (0.00% GC)
maximum time: 33.540 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Poisson, and other distributions I am interested in (geometric) are implemented in terms of exponential sampling so that would be the first step anyway I think. Thank you for the response!
Ah, DiscreteNonParametric
currently does not work, since it uses the Sampler
interface.
julia> rand(local_rng(), DiscreteNonParametric(1:100,fill(1,100)./100),100)
ERROR: ArgumentError: Sampler for this object is not defined
Stacktrace:
[1] Random.Sampler(#unused#::Type{VectorizedRNG.Xoshift{2}}, sp::Random.SamplerType{UInt64}, #unused#::Val{1})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:145
[2] Random.Sampler(rng::VectorizedRNG.Xoshift{2}, x::Random.SamplerType{UInt64}, r::Val{1})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:139
[3] rand(rng::VectorizedRNG.Xoshift{2}, X::Random.SamplerType{UInt64})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:253
[4] rand(rng::VectorizedRNG.Xoshift{2}, ::Type{UInt64})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:256
[5] rand(rng::VectorizedRNG.Xoshift{2}, sp::Random.SamplerRangeNDL{UInt64, Int64})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/generation.jl:332
[6] rand(rng::VectorizedRNG.Xoshift{2}, X::UnitRange{Int64})
@ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:253
[7] rand(rng::VectorizedRNG.Xoshift{2}, s::Distributions.AliasTable)
@ Distributions ~/.julia/packages/Distributions/Xrm9e/src/samplers/aliastable.jl:17
[8] rand(rng::VectorizedRNG.Xoshift{2}, s::Distributions.DiscreteNonParametricSampler{Int64, UnitRange{Int64}, Distributions.AliasTable})
@ Distributions ~/.julia/packages/Distributions/Xrm9e/src/samplers/discretenonparametric.jl:22
[9] rand!
@ ~/.julia/packages/Distributions/Xrm9e/src/univariates.jl:167 [inlined]
[10] rand(rng::VectorizedRNG.Xoshift{2}, s::DiscreteNonParametric{Int64, Float64, UnitRange{Int64}, Vector{Float64}}, dims::Tuple{Int64})
@ Distributions ~/.julia/packages/Distributions/Xrm9e/src/univariates.jl:158
[11] rand(::VectorizedRNG.Xoshift{2}, ::DiscreteNonParametric{Int64, Float64, UnitRange{Int64}, Vector{Float64}}, ::Int64)
@ Distributions ~/.julia/packages/Distributions/Xrm9e/src/genericrand.jl:26
[12] top-level scope
@ REPL[7]:1
Ah,
DiscreteNonParametric
currently does not work, since it uses theSampler
interface.
I just added support for the Sampler
interface.
Note that this won't get full performance, just like randn(local_rng())
won't get full performance, but still a solid improvement on my laptop:
julia> dnp = DiscreteNonParametric(1:100,fill(1/100,100));
julia> @benchmark rand!($dnp, $x)
BenchmarkTools.Trial:
memory estimate: 3.50 KiB
allocs estimate: 4
--------------
minimum time: 1.358 μs (0.00% GC)
median time: 1.548 μs (0.00% GC)
mean time: 1.674 μs (2.54% GC)
maximum time: 72.847 μs (93.68% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark rand!(local_rng(), $dnp, $x)
BenchmarkTools.Trial:
memory estimate: 3.50 KiB
allocs estimate: 4
--------------
minimum time: 683.547 ns (0.00% GC)
median time: 764.054 ns (0.00% GC)
mean time: 814.725 ns (4.09% GC)
maximum time: 5.082 μs (78.20% GC)
--------------
samples: 10000
evals/sample: 148
julia> versioninfo()
Julia Version 1.7.0-DEV.1046
Commit d7c6a9b468* (2021-04-30 14:33 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_NUM_THREADS = 8
Sorry for the delay, I have had several hours to dig into the JuliaSIMD group of packages and have a naive and very slow exponential and poisson sampler. These have led me to some questions, hopefully you have time to point me in the right direction!
I have the poisson sampler below, but it allocates for some reason. I assumed this was due to a type instability, but haven't been able to find one. Beyond that, I think the next step might be to mirror the technique random_normal
function and use @generated
to write loops more efficiently depending on the size of the VecUnroll argument?
Is there any more documentation or examples on how VecUnroll, Unroll, and MM work? I figure that they are for loop unrolling, but I am unclear on the difference between them from looking through the VectorizationBase tests.
Thanks!
@inline function random_poisson!(state::AbstractState, ::Val{N}, ::Type{T},λ) where {N,T}
state, u = nextstate(state, Val{N}())
return state, random_poisson!(u, T,λ)
end
@inline function random_poisson!(in::AbstractSIMD{W,UInt64}, ::Type{T},λ) where {W,T}
u = random_uniform(in,T)
p_0 = typeof(u)(exp(-1*λ))
p = p_0
F = p
x = typeof(u)(zero(T))
loopmask = (u>F)
while any(data(VectorizationBase.vany(loopmask)))
x = VectorizationBase.vifelse(loopmask,x + 1,x)
p = λ*p/x
F = F+p
loopmask = (u>F)
end
return x
end
function randpoisson!(
rng::AbstractVRNG, x::AbstractArray{T}, λ, α::Number = StaticInt{0}(), β = StaticInt{0}(), γ = StaticInt{1}()
) where {T<:Union{Float32,Float64}}
f(x,val,T) = random_poisson!(x,val,T,λ)
random_sample_u2!(f, rng, x, α, β, γ)
end
I have the poisson sampler below, but it allocates for some reason. I assumed this was due to a type instability, but haven't been able to find one.
Add @inline
to f
:
function randpoisson!(
rng::AbstractVRNG, x::AbstractArray{T}, λ, α::Number = StaticInt{0}(), β = StaticInt{0}(), γ = StaticInt{1}()
) where {T<:Union{Float32,Float64}}
@inline f(x,val,T) = random_poisson!(x,val,T,λ)
random_sample_u2!(f, rng, x, α, β, γ)
end
This fixed the allocations for me:
julia> y = Vector{Float64}(undef, 1024);
julia> @benchmark randpoisson!(local_rng(), $y, 6.5)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.688 μs … 26.078 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.160 μs ┊ GC (median): 0.00%
Time (mean ± σ): 7.492 μs ± 677.720 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▇█▁ ▂▁
▁▁▁▄████▄▂▁▂▁▂▂▂▂▂▃▅███▅▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
6.69 μs Histogram: frequency by time 9.95 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Is there any more documentation or examples on how VecUnroll, Unroll, and MM work? I figure that they are for loop unrolling, but I am unclear on the difference between them from looking through the VectorizationBase tests.
MM
is just for vectorization. You can use it to load/store a Vec
, or convert it to a Vec
.
VecUnroll
is for unrolling, e.g. a VecUnroll
can hold 4 Vec
s, and then operating on them will apply the same operation to each of them. Unroll
can be used for loading/storing a VecUnroll
, which also allows for additional optimizations in some cases (e.g. going from AoS memory to SoA in registers).
Here is a simple example of using Unroll
to load.
I have noticed that VectorizedRNG supports only uniform and normal samples. Are there plans to add other distributions? Particularly, poisson random variates would be very useful.
If I am able to figure out the VectorizedRNG code, I would like to try to contribute some samplers as well. Would this repository be the place to put those?