JuliaSIMD / VectorizedRNG.jl

Vectorized uniform and normal random samplers.
MIT License
33 stars 7 forks source link

Other distributions? #12

Open pcjentsch opened 3 years ago

pcjentsch commented 3 years ago

I have noticed that VectorizedRNG supports only uniform and normal samples. Are there plans to add other distributions? Particularly, poisson random variates would be very useful.

If I am able to figure out the VectorizedRNG code, I would like to try to contribute some samplers as well. Would this repository be the place to put those?

chriselrod commented 3 years ago

I don't have plans, but should at least add SIMD exponential sampling. Note that you may have to change your algorithm for SIMD sampling.

For example, the ziggurat algorithm is the most popular method for drawing normal samples, but here I use the Box-Muller method here because it is easier to write a SIMD version.

Worst comes to worst, you can generate random uniformally distributed numbers (either UInt64 over the 2^64 possibilities, or Float64 over (0,1)) with SIMD, and then use these to produce Poisson samples or other distributions. With this approach, you should already be able to sample from distributions defined in Distributions.jl:

julia> using Distributions, VectorizedRNG

julia> rand(local_rng(), Poisson(11.0))
13

julia> rand(local_rng(), Poisson(11.0))
16

julia> @benchmark rand(local_rng(), Poisson(11.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     291.111 ns (0.00% GC)
  median time:      299.355 ns (0.00% GC)
  mean time:        301.583 ns (0.00% GC)
  maximum time:     669.811 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     296

julia> @benchmark rand(Poisson(11.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     291.471 ns (0.00% GC)
  median time:      301.865 ns (0.00% GC)
  mean time:        303.235 ns (0.00% GC)
  maximum time:     441.989 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     274

But if most of the time is spent in the serial conversion of uniform -> Poisson samples, you may not benefit much from faster uniform generation.

julia> @benchmark rand()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.127 ns (0.00% GC)
  median time:      4.654 ns (0.00% GC)
  mean time:        4.870 ns (0.00% GC)
  maximum time:     8.713 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rand(local_rng())
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.668 ns (0.00% GC)
  median time:      1.769 ns (0.00% GC)
  mean time:        1.795 ns (0.00% GC)
  maximum time:     6.398 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Although I haven't looked at how the Poisson sampler is implemented, so I'm not sure how many uniform samples it typically draws per poisson sample.

I'd be happy to accept contributions and to answer any questions.

pcjentsch commented 3 years ago

I see, thanks!

It does seem to be a bit faster for large vectors of samples, at least.

julia> @benchmark rand(local_rng(), Poisson(11.0), 5000)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     12.430 μs (0.00% GC)
  median time:      16.900 μs (0.00% GC)
  mean time:        16.990 μs (0.00% GC)
  maximum time:     26.290 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> ^C
julia> @benchmark rand(Poisson(11.0), 5000)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     17.260 μs (0.00% GC)
  median time:      21.450 μs (0.00% GC)
  mean time:        21.522 μs (0.00% GC)
  maximum time:     33.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Poisson, and other distributions I am interested in (geometric) are implemented in terms of exponential sampling so that would be the first step anyway I think. Thank you for the response!

pcjentsch commented 3 years ago

Ah, DiscreteNonParametric currently does not work, since it uses the Sampler interface.

julia> rand(local_rng(), DiscreteNonParametric(1:100,fill(1,100)./100),100)
ERROR: ArgumentError: Sampler for this object is not defined
Stacktrace:
  [1] Random.Sampler(#unused#::Type{VectorizedRNG.Xoshift{2}}, sp::Random.SamplerType{UInt64}, #unused#::Val{1})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:145
  [2] Random.Sampler(rng::VectorizedRNG.Xoshift{2}, x::Random.SamplerType{UInt64}, r::Val{1})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:139
  [3] rand(rng::VectorizedRNG.Xoshift{2}, X::Random.SamplerType{UInt64})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:253
  [4] rand(rng::VectorizedRNG.Xoshift{2}, ::Type{UInt64})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:256
  [5] rand(rng::VectorizedRNG.Xoshift{2}, sp::Random.SamplerRangeNDL{UInt64, Int64})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/generation.jl:332
  [6] rand(rng::VectorizedRNG.Xoshift{2}, X::UnitRange{Int64})
    @ Random /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:253
  [7] rand(rng::VectorizedRNG.Xoshift{2}, s::Distributions.AliasTable)
    @ Distributions ~/.julia/packages/Distributions/Xrm9e/src/samplers/aliastable.jl:17
  [8] rand(rng::VectorizedRNG.Xoshift{2}, s::Distributions.DiscreteNonParametricSampler{Int64, UnitRange{Int64}, Distributions.AliasTable})
    @ Distributions ~/.julia/packages/Distributions/Xrm9e/src/samplers/discretenonparametric.jl:22
  [9] rand!
    @ ~/.julia/packages/Distributions/Xrm9e/src/univariates.jl:167 [inlined]
 [10] rand(rng::VectorizedRNG.Xoshift{2}, s::DiscreteNonParametric{Int64, Float64, UnitRange{Int64}, Vector{Float64}}, dims::Tuple{Int64})
    @ Distributions ~/.julia/packages/Distributions/Xrm9e/src/univariates.jl:158
 [11] rand(::VectorizedRNG.Xoshift{2}, ::DiscreteNonParametric{Int64, Float64, UnitRange{Int64}, Vector{Float64}}, ::Int64)
    @ Distributions ~/.julia/packages/Distributions/Xrm9e/src/genericrand.jl:26
 [12] top-level scope
    @ REPL[7]:1
chriselrod commented 3 years ago

Ah, DiscreteNonParametric currently does not work, since it uses the Sampler interface.

I just added support for the Sampler interface.

Note that this won't get full performance, just like randn(local_rng()) won't get full performance, but still a solid improvement on my laptop:

julia> dnp = DiscreteNonParametric(1:100,fill(1/100,100));

julia> @benchmark rand!($dnp, $x)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.358 μs (0.00% GC)
  median time:      1.548 μs (0.00% GC)
  mean time:        1.674 μs (2.54% GC)
  maximum time:     72.847 μs (93.68% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark rand!(local_rng(), $dnp, $x)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  4
  --------------
  minimum time:     683.547 ns (0.00% GC)
  median time:      764.054 ns (0.00% GC)
  mean time:        814.725 ns (4.09% GC)
  maximum time:     5.082 μs (78.20% GC)
  --------------
  samples:          10000
  evals/sample:     148

julia> versioninfo()
Julia Version 1.7.0-DEV.1046
Commit d7c6a9b468* (2021-04-30 14:33 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8
pcjentsch commented 3 years ago

Sorry for the delay, I have had several hours to dig into the JuliaSIMD group of packages and have a naive and very slow exponential and poisson sampler. These have led me to some questions, hopefully you have time to point me in the right direction!

I have the poisson sampler below, but it allocates for some reason. I assumed this was due to a type instability, but haven't been able to find one. Beyond that, I think the next step might be to mirror the technique random_normal function and use @generated to write loops more efficiently depending on the size of the VecUnroll argument?

Is there any more documentation or examples on how VecUnroll, Unroll, and MM work? I figure that they are for loop unrolling, but I am unclear on the difference between them from looking through the VectorizationBase tests.

Thanks!

@inline function random_poisson!(state::AbstractState, ::Val{N}, ::Type{T},λ) where {N,T}
    state, u = nextstate(state, Val{N}())

    return state, random_poisson!(u, T,λ)
end

@inline function random_poisson!(in::AbstractSIMD{W,UInt64}, ::Type{T},λ) where {W,T}
    u = random_uniform(in,T)
    p_0 = typeof(u)(exp(-1*λ))
    p = p_0
    F = p
    x = typeof(u)(zero(T))
    loopmask = (u>F)
    while any(data(VectorizationBase.vany(loopmask)))
        x = VectorizationBase.vifelse(loopmask,x + 1,x)
        p = λ*p/x 
        F = F+p 
        loopmask = (u>F)
    end
    return x
end

function randpoisson!(
    rng::AbstractVRNG, x::AbstractArray{T}, λ, α::Number = StaticInt{0}(), β = StaticInt{0}(), γ = StaticInt{1}()
) where {T<:Union{Float32,Float64}}
    f(x,val,T) = random_poisson!(x,val,T,λ)
    random_sample_u2!(f, rng, x, α, β, γ)
end
chriselrod commented 3 years ago

I have the poisson sampler below, but it allocates for some reason. I assumed this was due to a type instability, but haven't been able to find one.

Add @inline to f:

function randpoisson!(
    rng::AbstractVRNG, x::AbstractArray{T}, λ, α::Number = StaticInt{0}(), β = StaticInt{0}(), γ = StaticInt{1}()
) where {T<:Union{Float32,Float64}}
    @inline f(x,val,T) = random_poisson!(x,val,T,λ)
    random_sample_u2!(f, rng, x, α, β, γ)
end

This fixed the allocations for me:

julia> y = Vector{Float64}(undef, 1024);

julia> @benchmark randpoisson!(local_rng(), $y, 6.5)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.688 μs …  26.078 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.160 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.492 μs ± 677.720 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▇█▁             ▂▁
  ▁▁▁▄████▄▂▁▂▁▂▂▂▂▂▃▅███▅▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.69 μs         Histogram: frequency by time        9.95 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Is there any more documentation or examples on how VecUnroll, Unroll, and MM work? I figure that they are for loop unrolling, but I am unclear on the difference between them from looking through the VectorizationBase tests.

MM is just for vectorization. You can use it to load/store a Vec, or convert it to a Vec. VecUnroll is for unrolling, e.g. a VecUnroll can hold 4 Vecs, and then operating on them will apply the same operation to each of them. Unroll can be used for loading/storing a VecUnroll, which also allows for additional optimizations in some cases (e.g. going from AoS memory to SoA in registers). Here is a simple example of using Unroll to load.