Open hofmannmartin opened 4 years ago
No, not yet. Sometime after #17 we could add complex number support.
For now, I'd follow the dealing-with-structs approach from the README.
Note that StructArray
s will be faster than the eventual support for arrays of structs.
+1 for complex number support. It's also an opportunity to be faster than C compiled with gcc/clang which does a bad job of vectorizing complex arithmetic. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79102 , https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336, https://bugs.llvm.org/show_bug.cgi?id=31800 , https://bugs.llvm.org/show_bug.cgi?id=31677
I managed to get it working
function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
@avx for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end
and the performance is actually slightly faster than my earlier implementation which uses VectorizationBase
and SIMDPirates
directly to handle complex arrays. This is not surprising since the complex multiplication requires shuffling of the Numbers in the registers in this case.
Investigating the assembly I found that @avx
does not seem to unroll the loop.
L80:
vmovups (%rbx,%rsi,4), %ymm4
vmovups (%rax,%rsi,4), %ymm5
vmulps %ymm5, %ymm2, %ymm6
vfmadd231ps %ymm4, %ymm3, %ymm6
vaddps (%r15,%rsi,4), %ymm6, %ymm6
vmulps %ymm5, %ymm3, %ymm5
vfmsub231ps %ymm4, %ymm2, %ymm5
vaddps (%rdx,%rsi,4), %ymm5, %ymm4
vmovups %ymm6, (%r15,%rsi,4)
vmovups %ymm4, (%rdx,%rsi,4)
addq $8, %rsi
cmpq %rdi, %rsi
jl L80
Is this intentional?
+1 for complex number support. It's also an opportunity to be faster than C compiled with gcc/clang which does a bad job of vectorizing complex arithmetic.
Speaking of which I find especially interesting is that the native julia (version 1.3) implementation
function test!(A, x, k, beta)
@simd for n=1:size(A,2)
@inbounds x[n] += beta*conj(A[k,n])
end
end
produces quite different asambly depending on the input types. For ComplesF64
it does not vectorize at all
L96:
vmovsd -8(%rcx,%rdx), %xmm3 # xmm3 = mem[0],zero
vmovsd (%rcx,%rdx), %xmm4 # xmm4 = mem[0],zero
vxorpd %xmm2, %xmm4, %xmm4
vunpcklpd %xmm4, %xmm3, %xmm5 # xmm5 = xmm3[0],xmm4[0]
vmulpd %xmm0, %xmm5, %xmm5
vunpcklpd %xmm3, %xmm4, %xmm3 # xmm3 = xmm4[0],xmm3[0]
vmulpd %xmm3, %xmm1, %xmm3
vaddsubpd %xmm3, %xmm5, %xmm3
vaddpd (%rax,%rdx), %xmm3, %xmm3
vmovupd %xmm3, (%rax,%rdx)
addq $1, %rsi
addq $16, %rdx
cmpq %rdi, %rsi
jb L96
whereas it completely screws up with ComplexF32
L240:
vpsllq $3, %ymm5, %ymm9
vmovq %rax, %xmm2
vpbroadcastq %xmm2, %ymm2
vpaddq %ymm9, %ymm2, %ymm9
vpsllq $3, %ymm4, %ymm11
vpaddq %ymm11, %ymm2, %ymm2
vpcmpeqd %xmm3, %xmm3, %xmm3
vgatherqps %xmm3, (,%ymm2), %xmm1
vpcmpeqd %xmm3, %xmm3, %xmm3
vgatherqps %xmm3, (,%ymm9), %xmm11
vpcmpeqd %xmm3, %xmm3, %xmm3
vmovdqu 96(%rsp), %ymm0
vpaddq %ymm4, %ymm0, %ymm14
vpaddq %ymm5, %ymm0, %ymm15
vpaddq %ymm13, %ymm9, %ymm9
vpsllq $3, %ymm15, %ymm15
vmovq %r10, %xmm10
vpbroadcastq %xmm10, %ymm10
vpaddq %ymm15, %ymm10, %ymm15
vpsllq $3, %ymm14, %ymm14
vpaddq %ymm14, %ymm10, %ymm10
vpaddq %ymm13, %ymm2, %ymm2
vpcmpeqd %xmm6, %xmm6, %xmm6
vgatherqps %xmm6, (,%ymm10), %xmm7
vpcmpeqd %xmm6, %xmm6, %xmm6
vgatherqps %xmm6, (,%ymm15), %xmm14
vpaddq %ymm13, %ymm10, %ymm6
vgatherqps %xmm3, (,%ymm2), %xmm0
vpcmpeqd %xmm2, %xmm2, %xmm2
vgatherqps %xmm2, (,%ymm6), %xmm3
vpcmpeqd %xmm2, %xmm2, %xmm2
vgatherqps %xmm2, (,%ymm9), %xmm6
vinsertf128 $1, %xmm1, %ymm11, %ymm2
vinsertf128 $1, %xmm7, %ymm14, %ymm7
vpaddq %ymm13, %ymm15, %ymm10
vpcmpeqd %xmm1, %xmm1, %xmm1
vinsertf128 $1, %xmm0, %ymm6, %ymm0
vgatherqps %xmm1, (,%ymm10), %xmm6
vinsertf128 $1, %xmm3, %ymm6, %ymm1
vmovups 64(%rsp), %ymm10
vmulps %ymm10, %ymm7, %ymm3
vmovups 32(%rsp), %ymm11
vmulps %ymm1, %ymm11, %ymm6
vaddps %ymm6, %ymm3, %ymm3
vmulps %ymm1, %ymm10, %ymm1
vmulps %ymm11, %ymm7, %ymm6
vaddps %ymm3, %ymm2, %ymm2
vsubps %ymm1, %ymm6, %ymm1
vaddps %ymm1, %ymm0, %ymm0
vmovq %xmm9, %rdx
vunpckhps %xmm0, %xmm2, %xmm1 # xmm1 = xmm2[2],xmm0[2],xmm2[3],xmm0[3]
vunpcklps %xmm0, %xmm2, %xmm3 # xmm3 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
vinsertf128 $1, %xmm1, %ymm3, %ymm1
vextractf128 $1, %ymm0, %xmm0
vextractf128 $1, %ymm2, %xmm2
vunpckhps %xmm0, %xmm2, %xmm3 # xmm3 = xmm2[2],xmm0[2],xmm2[3],xmm0[3]
vunpcklps %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
vmovups %ymm1, -4(%rdx)
vinsertf128 $1, %xmm3, %ymm0, %ymm0
vmovdqu (%rsp), %ymm1
vpaddq %ymm1, %ymm4, %ymm4
vpaddq %ymm1, %ymm5, %ymm5
addq $-8, %rsi
vmovups %ymm0, 28(%rdx)
jne L240
The complex support that will be added will be worse than using StructArrays, so you won't do better than the test_avx!
implementation (barring other improvements in the library).
Depending on architecture, it might end up producing code like the completely screw
ed up ComplexF32
example.
Did you try test!
, using those StructArrays as arguments?
You noted that @avx
being faster wasn't surprising because the StructArrays
save you from any shuffling. LLVM will also take advantage of that to produce much nicer code. It might do just as well as avx
.
(If it does better than @avx
, file an issue, and I'll try and find out why / how to improve.)
Also, to be pedantic, the ComplesF64
code with @simd
did actually vectorize:
vunpcklpd %xmm4, %xmm3, %xmm5 # xmm5 = xmm3[0],xmm4[0]
vmulpd %xmm0, %xmm5, %xmm5
vunpcklpd %xmm3, %xmm4, %xmm3 # xmm3 = xmm4[0],xmm3[0]
vmulpd %xmm3, %xmm1, %xmm3
vaddsubpd %xmm3, %xmm5, %xmm3
vaddpd (%rax,%rdx), %xmm3, %xmm3
vmovupd %xmm3, (%rax,%rdx)
The suffix on all of these instructions is pd
. This means "packed double", ie, packed double precision vectors. If it were not vectorized at all, the suffix would be sd
for "single double" (eg, it would use vmulsd
instead of vmulpd
).
However, the xmm
registers are only 128 bits, meaning they are operations on packed doubles of 2 Float64
. This is roughly twice as good as sd
operations, but only half as good as it would have been had the registers been ymm
.
LLVM probably decided all the shuffling it would take (like you see in the ComplexF32
code) would not have been worth it, and would've made the code slower.
Investigating the assembly I found that @avx does not seem to unroll the loop. Is this intentional?
It is intentional.
I normally introspect via print debugging with revise.
Obviously not a good idea for anyone not familiar with the code, and I'd prefer most people to be on tagged releases rather than dev
-ing the library.
So I'll add more functions for introspection, and document them when I write documentation.
For now, you can at least do this to see what it decides (although not why):
using LoopVectorization
testq = :(for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end)
lstest = LoopVectorization.LoopSet(testq);
LoopVectorization.choose_order(lstest)
# ([:i], :i, 1, -1)
The four returned values mean:
-1
, it won't.As for why it won't unroll, I have it tuned to be fairly conservative when there aren't any loop-carried dependencies.* When a loop iteration depends on the previous one, but things are more or less associative (eg, a dot product), you can unroll to have several independent dependency chains that can be evaluated in parallel. However, if you don't have any such dependency chains, out-of-order execution shouldn't need extra help in speculatively evaluating multiple loop iterations in parallel.
*This is a heuristic that could use tuning. Currently, if there are no loop carried dependencies, it will check if there is only a single loop (as in your case) and return 1 if so. Otherwise, it will use:
min(4, round(Int, (compute_rt + load_rt + 1) / compute_rt))
Where compute_rt
and load_rt
are the estimated summed reciprical throughputs of "compute" and "load" instructions. Optionally greater than 1, because there is a cost to traversing strides.
There are certainly much smarter and more principled ways this can be done.
Thanks @lesshaste , I'll check out those bug reports.
Did you try
test!
, using those StructArrays as arguments? You noted that@avx
being faster wasn't surprising because theStructArrays
save you from any shuffling. LLVM will also take advantage of that to produce much nicer code. It might do just as well asavx
. (If it does better than@avx
, file an issue, and I'll try and find out why / how to improve.)
Good point I tried out using only the structs
function test_struct!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_struct!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end
and in fact LLVM does produce exactly the same low level code so @avx
is actually not needed here.
Also, to be pedantic, the
ComplesF64
code with@simd
did actually vectorize:
True it did vectorize, however quite suboptimal.
For now, you can at least do this to see what it decides (although not why):
using LoopVectorization testq = :(for i ∈ 1:length(xre) xre[i] += βre*Are[i,k] + βim*Aim[i,k] xim[i] += βre*Aim[i,k] - βim*Are[i,k] end) lstest = LoopVectorization.LoopSet(testq); LoopVectorization.choose_order(lstest) # ([:i], :i, 1, -1)
The four returned values mean:
1. The order it'll nest the loops, from outer-most to inner-most. 2. The loop from which it'll load contiguous SIMD vectors. 3. The unroll factor: 1, as you noted. 4. The tiling factor. If `-1`, it won't.
As for why it won't unroll, I have it tuned to be fairly conservative when there aren't any loop-carried dependencies.* When a loop iteration depends on the previous one, but things are more or less associative (eg, a dot product), you can unroll to have several independent dependency chains that can be evaluated in parallel. However, if you don't have any such dependency chains, out-of-order execution shouldn't need extra help in speculatively evaluating multiple loop iterations in parallel.
*This is a heuristic that could use tuning. Currently, if there are no loop carried dependencies, it will check if there is only a single loop (as in your case) and return 1 if so. Otherwise, it will use:
min(4, round(Int, (compute_rt + load_rt + 1) / compute_rt))
Where
compute_rt
andload_rt
are the estimated summed reciprical throughputs of "compute" and "load" instructions. Optionally greater than 1, because there is a cost to traversing strides.There are certainly much smarter and more principled ways this can be done.
Thank you for this insight. I was just asking since I found that my earlier implementation which uses VectorizationBase
and SIMDPirates
on Complex
Array
s did profit from loop unroling (over 20% perfromance increase).
L304:
vmovups (%rcx,%rsi), %ymm2
vmovups (%rdx,%rsi), %ymm3
vfmadd231ps %ymm0, %ymm2, %ymm3
vpermilps $177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm3, %ymm1, %ymm2
vmovups %ymm2, (%rdx,%rsi)
addq $32, %rsi
addq $-1, %rbp
jne L304
vs
L272:
vmovups -96(%rdi,%rsi), %ymm2
vmovups -64(%rdi,%rsi), %ymm3
vmovups -32(%rdi,%rsi), %ymm4
vmovups (%rdi,%rsi), %ymm5
vmovups -96(%rdx,%rsi), %ymm6
vfmadd231ps %ymm1, %ymm2, %ymm6
vpermilps $177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm6, %ymm0, %ymm2
vmovups %ymm2, -96(%rdx,%rsi)
vmovups -64(%rdx,%rsi), %ymm2
vfmadd231ps %ymm1, %ymm3, %ymm2
vpermilps $177, %ymm3, %ymm3 # ymm3 = ymm3[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm2, %ymm0, %ymm3
vmovups %ymm3, -64(%rdx,%rsi)
vmovups -32(%rdx,%rsi), %ymm2
vfmadd231ps %ymm1, %ymm4, %ymm2
vpermilps $177, %ymm4, %ymm3 # ymm3 = ymm4[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm2, %ymm0, %ymm3
vmovups %ymm3, -32(%rdx,%rsi)
vmovups (%rdx,%rsi), %ymm2
vfmadd231ps %ymm1, %ymm5, %ymm2
vpermilps $177, %ymm5, %ymm3 # ymm3 = ymm5[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm2, %ymm0, %ymm3
vmovups %ymm3, (%rdx,%rsi)
subq $-128, %rsi
addq $-1, %rcx
jne L272
Would it be possible to provide a method to somehow override the heuristic for manual fine tuning?
Disappointing, I really expected this to be good enough:
function test!(x, a, beta)
@inbounds @simd ivdep for n = 1:size(a,2)
x[n] += beta * conj(a[n])
end
end
N = 1024;
T = ComplexF32
x = rand(T, N);
a = rand(T,N);
beta = rand(T);
@code_native debuginfo=:none test!(x, a, beta) # not vectorized
using StructArrays
xsoa = StructArray(x);
asoa = StructArray(a);
@code_native debuginfo=:none test!(xsoa, asoa, beta) # also not vectorized =(
You shouldn't have to do things manually.
did profit from loop unroling (over 20% perfromance increase).
Interesting. Sounds like I'll have to change the heuristic. I'll add an option for overriding the heuristic. That would make it a lot easier to try different things, and thus discover when the heuristic needs work. Ideally it's made as robust as possible to save everyone's time from having to fiddle with settings. One of my motivations behind this library was to automate the fiddling I had been doing.
Tiling shouldn't be overridden for code other people run, because the optimal values are platform dependent.
Disappointing, I really expected this to be good enough:
Would have been nice.
Interesting. Sounds like I'll have to change the heuristic.
It was actually a post on julia dicourse and the openblas cdot kernel which had me try unrolling the loop.
This may be a stupid question, but can the out-of-order execution work across such a loop?
L304:
vmovups (%rcx,%rsi), %ymm2
vmovups (%rdx,%rsi), %ymm3
vfmadd231ps %ymm0, %ymm2, %ymm3
vpermilps $177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
vfmadd213ps %ymm3, %ymm1, %ymm2
vmovups %ymm2, (%rdx,%rsi)
addq $32, %rsi
addq $-1, %rbp
jne L304
Will it not be blocked, since in each iteration values are loaded into the ymm2
and ymm3
register, so it can only proceed, once vmovups %ymm2, (%rdx,%rsi)
. respectively vfmadd213ps %ymm3, %ymm1, %ymm2
have been executed? Would at least explain the gain I saw.
I'll add an option for overriding the heuristic. That would make it a lot easier to try different things, and thus discover when the heuristic needs work. Ideally it's made as robust as possible to save everyone's time from having to fiddle with settings. One of my motivations behind this library was to automate the fiddling I had been doing.
Would be perfect for testing. Final usage of the package is a different story. There you do not want to fiddle around with these settings.
I added the ability to manually specify the unroll factor in v0.3.8, via specifying unroll = U
or tile = (U,T)
where U and T are integer literals (ie, they can't be symbols).
can the out-of-order execution work across such a loop?
This technique is used to eliminate false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them. The elimination of these false data dependencies reveals more instruction-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance.
Seems as though it is not enough. There is also still a dependency chain in the loop counter.
The difference isn't close to as extreme as when we have real loop carried dependencies:
julia> using BenchmarkTools, LoopVectorization
julia> function selfdotu1(x)
s = zero(eltype(x))
@avx unroll=1 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu1 (generic function with 1 method)
julia> function selfdotu2(x)
s = zero(eltype(x))
@avx unroll=2 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu2 (generic function with 1 method)
julia> function selfdotu4(x)
s = zero(eltype(x))
@avx unroll=4 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu4 (generic function with 1 method)
julia> function selfdotu8(x)
s = zero(eltype(x))
@avx unroll=8 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu8 (generic function with 1 method)
julia> x = rand(1024);
julia> @btime selfdotu1($x)
138.725 ns (0 allocations: 0 bytes)
341.62521511329
julia> @btime selfdotu2($x)
81.118 ns (0 allocations: 0 bytes)
341.6252151132899
julia> @btime selfdotu4($x)
57.550 ns (0 allocations: 0 bytes)
341.6252151132899
julia> @btime selfdotu8($x)
45.087 ns (0 allocations: 0 bytes)
341.6252151132899
but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.
I added the ability to manually specify the unroll factor in v0.3.8, via specifying
unroll = U
ortile = (U,T)
where U and T are integer literals (ie, they can't be symbols).
You are fast. Thank you.
but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.
Overall if I compare the native julia implementation working on the complex arrays with the one working on structs boosted by @avx
(which now unrolls) I get a speedup of 12. This is awesome.
I added the ability to manually specify the unroll factor in v0.3.8, via specifying
unroll = U
ortile = (U,T)
where U and T are integer literals (ie, they can't be symbols).You are fast. Thank you.
but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.
Overall if I compare the native julia implementation working on the complex arrays with the one working on structs boosted by
@avx
(which now unrolls) I get a speedup of 12. This is awesome.
Great to see this comparison and the unrolling. To clarify: is the factor of 12 achieved with something like unroll=8
applied to test_avx!
?
i.e.
function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
@avx unroll=8 for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end
Does it matter if arrays are three-dimensional?
unroll=8
should be unnecessary.
It should work for multidimensional arrays.
Hi All,
Note that there another possibility of using complex numbers with LoopVectorization: Instead of having separate arrays containing real & imaginary parts, you can just reinterpret the complex array x with
y = reinterpret(real(typeof(x)), x)
which could be made more convenient by using
y = reshape(y, 2, size(x)...)
on it. For example, the real part is then y[1, ...]. I haven't checked the performance of this, but it runs fast enough for my use case so I wanted to point it out here :)
Felix
You are right. This is also how I solved my problem in the end. Using reinterpret
and hand crafting a kernel for my function
function test!(x::Vector, y::Vector, beta)
for n=1:length(x)
x[n] += conj(beta)*y[n]
end
end
Initially however, my question was aiming towards a solution, which does not require a hand crafted kernel.
which could be made more convenient by using
Note that as of Julia 1.6, you can use reinterpret(reshape, real(typeof(x)), x)
instead, to combine it into one step.
Doing this should yield slightly better performance than doing it in two steps, as this way LoopVectorization/VectorizationBase will know exactly what stride(y, 2)
equals at compile time, allowing it to generate a better sequence of shuffle instructions to do the AoS<->SoA conversion.
You can see a few examples here: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/test/shuffleloadstores.jl
my question was aiming towards a solution, which does not require a hand crafted kernel.
I am very slowly rewriting LoopVectorization.
The rewrite will be able to handle this automatically, i.e. you'll be able to use Complex
numbers directly.
Does
LoopVectorization
support complex vectors? E.g.gives me the following error