JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
744 stars 67 forks source link

cis (and Float64 -> ComplexF64) support? #131

Open platawiec opened 4 years ago

platawiec commented 4 years ago

Some vectorized math libraries (i. e. Intel VML) have a specialized cis routine, although I notice SLEEF doesn't.

Is there any possible benefit to adding cis support? In the "Future Work" section of the docs, plans are mentioned to support heterogeneous types in the loop and non-primitive types better. Would that be a prerequisite for including a fast cis implementation?

I also noted that if I wrote out the loop, things would at least run, but using the convenience functions gave me an error:

using LoopVectorization

x = rand(1000)
vmap(cis, x)

throws

ERROR: KeyError: key Complex{Float64} not found
chriselrod commented 4 years ago

Hi, thanks for the issue. Currently, it doesn't support multiple return values at all. That is, sincos won't work either, where the return value is Tuple{Float64,Float64}.

Supporting cis and sincos just in vmap would require a less changes than with the @avx macro.

chriselrod commented 4 years ago

On LoopVectorization 0.8.9, you can now do:

using StructArrays, LoopVectorization

function vcis!(y::StructArray{<:Complex}, x::AbstractVector)
    re = y.re; im = y.im;
    @avx for i in eachindex(re,im,x)
        im[i], re[i] = sincos(x[i])
    end
    y
end
function vcis(x::AbstractArray)
    T = float(eltype(x))
    vcis!(StructArray{Complex{T}}((similar(x, T), similar(x, T))), x)
end

On my computer, I get:

julia> x = rand(1000);

julia> vcis(x) ≈ map(cis, x)
true

julia> @benchmark vcis($x)
BenchmarkTools.Trial:
  memory estimate:  15.88 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.617 μs (0.00% GC)
  median time:      1.812 μs (0.00% GC)
  mean time:        2.231 μs (12.72% GC)
  maximum time:     119.669 μs (94.97% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark map(cis, $x)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     8.442 μs (0.00% GC)
  median time:      8.738 μs (0.00% GC)
  mean time:        9.076 μs (2.44% GC)
  maximum time:     307.998 μs (93.81% GC)
  --------------
  samples:          10000
  evals/sample:     3
platawiec commented 4 years ago

Thanks, this is great! I had a look through the changes and I wanted to check my understanding. Is it accurate to say that LoopVectorization now supports multiple return values? And is this a generic change, or just in the case of sincos? Is it the same story for complex values? If I understand a bit better I'd be happy to contribute documentation

chriselrod commented 4 years ago

Thanks, this is great! I had a look through the changes and I wanted to check my understanding. Is it accurate to say that LoopVectorization now supports multiple return values? And is this a generic change, or just in the case of sincos?

Yes and yes, for example:

julia> using LoopVectorization

julia> foo(x) = (x,2x,x*x)
foo (generic function with 1 method)

julia> function sumfoo(y)
           s = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               a, b, c = foo(y[i])
               s += (c + c) / (a * b) # 2x^2 / 2x^2 = 1
           end
           s
       end
sumfoo (generic function with 1 method)

julia> y = rand(99);

julia> sumfoo(y)
99.0

There's still no support for complex values. The recommended approach for handling them is still to use StructArrays.jl, and then load from/store into the constituent re and im arrays, like in the sincos/vcis example.

Contributions to the documentation, examples, etc would all be greatly appreciated, and I'd be more than happy to answer any questions. Others are likely to have those same questions, meaning it'd be great if the answers were documented.