Open zsoerenm opened 4 years ago
LoopVectorization evaluates your code using SVecs
. An SVec{W,T}
contains W
elements of type T
. Most operations on SVec
are defined using SIMD instructions. It's basically how LoopVectorization says "apply single instructions to operate on all of these numbers simultaneously". It puts the Vec
torization in LoopVectorization.
But, as a limitation, that means any functions you call will have to accept SVec
arguments.
I don't know at the moment how I can work around that. Maybe some trick with Cassette is possible, maybe it can make SVec{<:Any,T}
dispatch on ::T
methods.
Another limitation is that LoopVectorization's optimizer only understands a few functions. If anything reasonably generic is missing, please let me know and I can add it.
While functions you define will work, LoopVectorization cannot look inside them while it makes decisions about how it will evaluate your code. Currently, it assumes those functions are expensive, and that thus doing anything creative will probably not be profitable. Functions like exp
, log
, and sin
are expensive enough for that to be the case (even though I did include them).
That said, you can manually tell it what to do, eg, unroll by 2 (anything 1-4 will probably be pretty good, above that probably wont be worth it; above 8 will probably be bad):
function gen_sincos!(sins, coss, phases)
@avx unroll=2 for i = 1:2500
first_bit_sign = get_first_bit_sign(phases[i])
second_bit_sign = get_second_bit_sign(phases[i])
quarter_angle = get_quarter_angle(phases[i])
A = calc_A(quarter_angle)
B = calc_B(quarter_angle)
coss[i] = second_bit_sign * (first_bit_sign * A + B)
sins[i] = second_bit_sign * (A - first_bit_sign * B)
end
end
Because it is assuming each of these functions is expensive, it should be deciding on unroll=1
.
So it'll probably be worth experimenting with at least unroll=2
and unroll=4
, to see if they improve performance.
I saw a roughly 15% performance increase from unroll=4
.
It would also definitely be worth looking into if there is some way to figure out the cost of user-defined functions.
If manually specifying the unroll (or tile=(3,4)
when there's more than 1 loop) in code without any user defined functions gives you a performance boost, feel free to file an issue, and I'll see if I can improve the algorithm. Currently, it's pretty naive for simple loops without reductions (like this one), and I'm sure it could be made a lot better.
EDIT: Alright I found the solution myself: If I write function calc_A(x) instead of function calc_A(x::Int16), it will work. So should I omit the declaration of x? What if I have a different function for x::Int32?
You can write:
using LoopVectorization: SVec
function calc_A(x::Union{Int16,SVec{<:Any,Int16}})
...
end
function calc_A(x::Union{Int32,SVec{<:Any,Int32}})
...
end
or, if you're making a lot of definitions, it may be cleaner to define type aliases (const
unions):
using LoopVectorization: SVec
const VInt16 = Union{Int16,SVec{<:Any,Int16}}
const VInt32 = Union{Int32,SVec{<:Any,Int32}}
const VInt64 = Union{Int64,SVec{<:Any,Int64}}
function calc_A(x::VInt32)
...
end
function calc_B(x::VInt32)
...
end
function calc_C(x::VInt32)
...
end
Perhaps I should define these type aliases myself? What would be the best name? V[scalar type]
?
Thank you for detailed insights!
Perhaps I should define these type aliases myself? What would be the best name? V[scalar type]?
Sounds like a good option.
I get the following error
where
calc_A
is a user defined function. Do I need to implementcalc_A(::VectorizationBase.SVec{16,Int16})
myself? Why isn't there an error for the other user defined functions? My first guess was because they are inlined, butcalc_A
should be inlined, too.Here is the code for that:
EDIT: Alright I found the solution myself: If I write
function calc_A(x)
instead offunction calc_A(x::Int16)
, it will work. So should I omit the declaration ofx
? What if I have a different function forx::Int32
?