JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
742 stars 66 forks source link

How to handle user defined functions? #55

Open zsoerenm opened 4 years ago

zsoerenm commented 4 years ago

I get the following error

ERROR: MethodError: no method matching calc_A(::VectorizationBase.SVec{16,Int16})

where calc_A is a user defined function. Do I need to implement calc_A(::VectorizationBase.SVec{16,Int16}) myself? Why isn't there an error for the other user defined functions? My first guess was because they are inlined, but calc_A should be inlined, too.

Here is the code for that:

@inline function calc_A(x::Int16)
    p = 15; r = 1; A = Int16(23170); C = Int16(-425)
    n = 7
    a = 7
    x² = (x * x) >> n
    rounding = one(x) << (p - a - 1)
    (A + rounding + (x² * C) >> r) >> (p - a)
end
@inline function calc_B(x::Int16)
    p = 14; r = 3; B = Int16(-17790); D = Int16(351)
    n = 7
    a = 7
    x² = (x * x) >> n
    rounding = one(x) << (p - a - 1)
    (rounding + x * (B + (x² * D) >> r) >> n) >> (p - a)
end
@inline function get_first_bit_sign(x)
    n = 7
    mysign(x << (sizeof(x) * 8 - n - 1))
end
@inline function get_second_bit_sign(x)
    n = 7
    mysign(x << (sizeof(x) * 8 - n - 2))
end
@inline function get_quarter_angle(x)
    n = 7
    x & (one(x) << n - one(x)) - one(x) << (n - 1)
end
@inline mysign(x) = 2 * signbit(x) - 1

function gen_sincos!(sins, coss, phases)
    @avx for i = 1:2500
        first_bit_sign = get_first_bit_sign(phases[i])
        second_bit_sign = get_second_bit_sign(phases[i])
        quarter_angle = get_quarter_angle(phases[i])
        A = calc_A(quarter_angle)
        B = calc_B(quarter_angle)

        coss[i] = second_bit_sign * (first_bit_sign * A + B)
        sins[i] = second_bit_sign * (A - first_bit_sign * B)
    end
end
sins = Vector{Int16}(undef, 2500)
coss = Vector{Int16}(undef, 2500)
phases = Vector{Int16}(undef, 2500)
gen_sincos!(sins, coss, phases)

EDIT: Alright I found the solution myself: If I write function calc_A(x) instead of function calc_A(x::Int16), it will work. So should I omit the declaration of x? What if I have a different function for x::Int32?

chriselrod commented 4 years ago

LoopVectorization evaluates your code using SVecs. An SVec{W,T} contains W elements of type T. Most operations on SVec are defined using SIMD instructions. It's basically how LoopVectorization says "apply single instructions to operate on all of these numbers simultaneously". It puts the Vectorization in LoopVectorization.

But, as a limitation, that means any functions you call will have to accept SVec arguments. I don't know at the moment how I can work around that. Maybe some trick with Cassette is possible, maybe it can make SVec{<:Any,T} dispatch on ::T methods.

Another limitation is that LoopVectorization's optimizer only understands a few functions. If anything reasonably generic is missing, please let me know and I can add it. While functions you define will work, LoopVectorization cannot look inside them while it makes decisions about how it will evaluate your code. Currently, it assumes those functions are expensive, and that thus doing anything creative will probably not be profitable. Functions like exp, log, and sin are expensive enough for that to be the case (even though I did include them).

That said, you can manually tell it what to do, eg, unroll by 2 (anything 1-4 will probably be pretty good, above that probably wont be worth it; above 8 will probably be bad):

function gen_sincos!(sins, coss, phases)
    @avx unroll=2 for i = 1:2500
        first_bit_sign = get_first_bit_sign(phases[i])
        second_bit_sign = get_second_bit_sign(phases[i])
        quarter_angle = get_quarter_angle(phases[i])
        A = calc_A(quarter_angle)
        B = calc_B(quarter_angle)

        coss[i] = second_bit_sign * (first_bit_sign * A + B)
        sins[i] = second_bit_sign * (A - first_bit_sign * B)
    end
end

Because it is assuming each of these functions is expensive, it should be deciding on unroll=1. So it'll probably be worth experimenting with at least unroll=2 and unroll=4, to see if they improve performance. I saw a roughly 15% performance increase from unroll=4.

It would also definitely be worth looking into if there is some way to figure out the cost of user-defined functions.

If manually specifying the unroll (or tile=(3,4) when there's more than 1 loop) in code without any user defined functions gives you a performance boost, feel free to file an issue, and I'll see if I can improve the algorithm. Currently, it's pretty naive for simple loops without reductions (like this one), and I'm sure it could be made a lot better.

EDIT: Alright I found the solution myself: If I write function calc_A(x) instead of function calc_A(x::Int16), it will work. So should I omit the declaration of x? What if I have a different function for x::Int32?

You can write:

using LoopVectorization: SVec
function calc_A(x::Union{Int16,SVec{<:Any,Int16}})
   ...
end
function calc_A(x::Union{Int32,SVec{<:Any,Int32}})
   ...
end

or, if you're making a lot of definitions, it may be cleaner to define type aliases (const unions):

using LoopVectorization: SVec
const VInt16 = Union{Int16,SVec{<:Any,Int16}}
const VInt32 = Union{Int32,SVec{<:Any,Int32}}
const VInt64 = Union{Int64,SVec{<:Any,Int64}}
function calc_A(x::VInt32)
   ...
end
function calc_B(x::VInt32)
   ...
end
function calc_C(x::VInt32)
   ...
end

Perhaps I should define these type aliases myself? What would be the best name? V[scalar type]?

zsoerenm commented 4 years ago

Thank you for detailed insights!

Perhaps I should define these type aliases myself? What would be the best name? V[scalar type]?

Sounds like a good option.