Closed KristofferC closed 7 years ago
Unfortunately, I am not well versed enough in LLVM IR to be able to pinpoint exactly what the problem is...
This generates very similar code and also fails to vectorize:
function tuple_simd_sum{T}(x::Vector{NTuple{4, T}})
s = (0.0, 0.0, 0.0, 0.0)
@inbounds @simd for i in eachindex(x)
x_i = x[i]
s = (s[1] + x_i[1], s[2] + x_i[2], s[3] + x_i[3], s[4] + x_i[4])
end
return s
end
tuple_vec = [(rand(), rand(), rand(), rand()) for i = 1:20]
@code_llvm tuple_simd_sum(tuple_vec)
If moving from NTuple
to another storage type for the epsilon components will make it easier to get access to SIMD, I'd say it's worth considering.
I don't know what other storage type that would be. A vector of GradientNumber is already packed efficiently in contiguous memory so I don't think the storage type is wrong, it is just that it is translated to LLVM IR in a non optimal way.
There are some patches pending to LLVM that might help https://github.com/JuliaLang/julia/pull/5355#issuecomment-176335389
It would be interesting to experiment with https://github.com/eschnett/SIMD.jl when it is mature enough.
Using NTuple
storage for epsilon components will allow us to manually leverage SIMD using llvmcall
, such that Parials{N,T,NTuple{N,T}}
(in an upcoming refactor, just Partials{N,T}
) behaves like SIMD.jl's Vec{N,T}
.
I think manually calling SIMD instructions might be the most consistent way to resolve this issue (instead of having to rely on the @simd
heuristics in Base).
I made a small proof of concept using SIMD.jl
s Vec
for the dual part at: https://github.com/JuliaDiff/ForwardDiff.jl/commit/3fcd7966941e3f1046fabdfdbd414680771fb4c1. I basically threw out the whole PartialVec
just to get it working as fast as possible.
Doing something like:
using ForwardDiff
using SIMD
function simd_sum{T}(x::Vector{T})
s = zero(T)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end
vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]
@code_native simd_sum(vec)
show a bunch of vectorized operations being used. It is however about twice as slow as the original code hehe. Maybe my implementation is not perfect or SIMD.jl needs a bit more battle experience. Still pretty interesting.
I made another proof of concept on getting the operations with the partial numbers becoming SIMD accelerated.
First I started on a package SIMDVectors.jl
: https://github.com/KristofferC/SIMDVectors.jl It uses https://github.com/JuliaLang/julia/pull/15244 to generate vectorized operations for a type similar to a tuple.
I then made a GradientNumber
and Partials
that uses this SIMDVector
as its Partials.data
field: https://gist.github.com/KristofferC/56594a9c57d6f27df769
The result is quite nicely generated SIMD code:
include(simd_fdiff.jl)
using SIMDVectors
a = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
b = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
@code_native a + b
.
vaddpd 16(%rdx), %xmm5, %xmm5
vaddpd 32(%rdx), %xmm4, %xmm4
vaddpd 48(%rdx), %xmm3, %xmm3
vaddpd 64(%rdx), %xmm2, %xmm2
vaddsd 80(%rdx), %xmm1, %xmm1
vaddsd (%rdx), %xmm0, %xmm0
.
@code_native a * b
julia> @code_native a * b
.
vmulpd 32(%rsi), %xmm1, %xmm10
vmulpd 48(%rsi), %xmm1, %xmm11
vmulpd 64(%rsi), %xmm1, %xmm8
Source line: 6
vmulsd 80(%rsi), %xmm0, %xmm5
Source line: 34
vmovsd (%rdx), %xmm6 # xmm6 = mem[0],zero
Source line: 8
vmovddup %xmm6, %xmm7 # xmm7 = xmm6[0,0]
vmulpd 16(%rdx), %xmm7, %xmm1
vmulpd 32(%rdx), %xmm7, %xmm2
vmulpd 48(%rdx), %xmm7, %xmm3
vmulpd 64(%rdx), %xmm7, %xmm7
Source line: 6
vmulsd 80(%rdx), %xmm6, %xmm4
Source line: 8
vaddpd %xmm1, %xmm9, %xmm1
vaddpd %xmm2, %xmm10, %xmm2
vaddpd %xmm3, %xmm11, %xmm3
vaddpd %xmm7, %xmm8, %xmm7
Source line: 6
vaddsd %xmm4, %xmm5, %xmm4
vmulsd %xmm6, %xmm0, %xmm0
.```
Using SIMD.jl
with the new #15244 I get this for a multiplication with two gradient numbers:
julia> @code_native a*b
.
vbroadcastsd %xmm1, %ymm2
vmulpd 32(%rsi), %ymm2, %ymm2
vbroadcastsd %xmm0, %ymm3
vmulpd 32(%rdx), %ymm3, %ymm3
vaddpd %ymm2, %ymm3, %ymm2
Source line: 65
vmulsd %xmm1, %xmm0, %xmm0
.
which uses lovely AVX 256 bit instructions. The example in the OP also runs faster.
This is so cool. I suppose now the race is on for me to wrap up #102 before JuliaLang/julia#15244 lands.
There are of course a few caveats, SIMD Vectors are limited to length 16 and SIMD.jl
has no support for other than C types. I think the best thing would be to let the user chose Vec
or Tuple
as the backend storage for Partials
when creating the gradient function, default to Tuple
of course.
It shouldn't be that much code since only simple operations are needed on Partials
.
I guess that means no nested differentiation as well..
SIMD + nested differentiation is not a use case that I care too much about
So how much faster will a multiplication of two 16-length gradient numbers be? 4x faster?
Theoretically yes, but I'm segfaulting now when I run @benchmark
so can't actually properly test it.
Couldn't SIMDVector
still be used for the "innermost" partials of a nested differentiation?
The best would be if whatever is used to hold the partials would just degenerate to a normal tuple when the type is not something that simd would work on. As in https://github.com/KristofferC/SIMDVectors.jl#user-defined-number-types. SIMD.jl
is better than that package though so adding support to that would be nice.
@jrevels I was just reading about and experimenting with Julia 0.5's SIMD support. Are there any technical hurdles to implementing SIMD optimizations for the Partials
type at this time?
As I think you suggested earlier, it seems like Partials{N,T}
could be defined as:
immutable Partials{N,T}
values::NTuple{N,VecElement{T}}
end
With the corresponding tupexpr
functions specialized to emit LLVM SIMD opcodes for the cases where T
is a bitstype
.
T
is not a bitstype
?Note
However, since the automatic vectorization cannot be relied upon, future use will mostly be via libraries that use llvmcall.
This is what SIMD.jl is doing.
@KristofferC Understood. Also looked at SIMD.jl.
I thought @jrevels expressed the opinion that he would prefer to insert the llvmcall
code directly into ForwardDiff.jl, rather than rely on an external dependency. (At least that's how I interpreted his above comment on Feb 3.)
Is SIMD.jl mature enough to be used as a dependency at this time?
Since we do not require many fancy operations on the partials it should be possible to just roll our own. When I experimented with this I got random crashes when trying to benchmark but perhaps the situation had improved.
Attempting to answer my first question about this approach, i.e. what is the cost of defining the partials as NTuple{N, VecElement{T}}
?
I created a branch of ForwardDiff.jl where the Partials are defined in this way (for Julia 0.5 only). No SIMD optimizations have been added (yet) -- only the storage of the partials is changed. For 1st derivatives, the performance seems to be unchanged (expected). However, I'm seeing a 2-4x slowdown when using nested Dual
s, and this is certainly problematic. This problem will need to be solved before it's even worth thinking about introducing the SIMD operations.
It would be really great to change the definition of type Partials{N,T}
type, based on the value of the type parameters (i.e. if T
is a bitstype
, wrap it in VecElement
, otherwise, just use a normal NTuple{N,T}
).
Is there any good way to do this?
Otherwise, we may need to have multiple types of Dual
, or (alternatively) allow a Dual's value
and partials
types to be different.
Thoughts?
What is the problem with using VecElement if T is not a bits type? It should degenerate to exactly the same LLVM representation as a tuple of T for non SIMD supported T.
The problem with nesting Dual
within VecElement
seems to be performance. (The code still appears to run correctly.) It's possible that I've done something wrong in my code, but I'm only seeing the performance issue when computing 2nd (or higher) derivatives.
How does the generated code look for something really simple like adding a two nested dual numbers with VecElement / Tuple?
@KristofferC Please see the attached files which show the generated code and timings for sizes of {1,3,6,9} x {1,3,6,9}. In the 1x1, 3x1, 6x1, and 9x1 cases, there is virtually no difference in the generated code. However, as the number of partials are increased in the nested dual, the generated code becomes considerably lengthier (with the 9x9 case being quite lengthy, indeed.)
This seems to imply that VecElement
has a cost, and that, sadly, that cost appears in exactly the place we were hoping to create the performance boost: when the innermost Dual
has high dimension. My theory is that the trouble comes from nesting the inner-Dual
s inside a VecElement
container. If we can somehow peel away this wrapping for the outer-Duals, but leave it intact for the innermost Dual
s (i.e. where we need it), then perhaps we can still achieve the desired optimization.
dualtest.jl.txt output_diff_sidebyside.txt output_vecel_disabed.txt output_vecel_enabled.txt
UPDATE: It appears that the generated code for (m by n; n > 1) duals is suffering from unnecessary allocations and memory copy calls.
@jrevels @KristofferC I'm now taking a different tack:
SIMDPartials
. It has all the same operations as Partials
except that the storage is a tuple of VecElement{T}
.Dual{N,T}
must have the same value type and partials type.... something closer to Dual{N,V,P}
where V
is the value type, and P
is the partials type (either Partials
or SIMDPartials
).Dual
uses SIMDPartials
.I'm banging away on this this afternoon and will see where I get. Let me know if you have any thoughts.
Hopefully that works. However, I don't think there is any inherit reason why nested Vec
s shouldn't work just as well, so if you already have some code around that shows the problem, perhaps a Base issue would be good?
Update: My above approach won't work. Thanks @KristofferC for encouraging me to further distill the problem and file an issue.
Feel free to read JuliaLang/Julia#18857. It's possible that introducing the llvmcall
optimizations will be sufficient to improve the performance, but it's starting to feel like a bit of a reach...
One of the problems with using Vec
without lowering to the correct SIMD instructions is that LLVM will have to effectively unpack and repack at every step. More or less you tell LLVM that is has to store this data in a particular way and place (the SIMD registers) and then you move it out and pack into these register when doing work on them (and that is quite inefficient).
I have been working on supporting vector intrinsics in base Julia https://github.com/JuliaLang/julia/pull/18470, but I didn't make much progress on that recently. (The problem is what to do in the case that LLVM is not available). The goal of that work was to prevent people from having to roll their own llvmcall implementations. (LLVM IR is not stable)
In the mean time I would recommend using SIMD.jl it is a very nice package and it solved several of the hard problems already.
@dbeach24 Just caught up with this issue after returning from vacation, thanks for looking into it. I'm a SIMD noob, and haven't really been following recent developments in Julia/the ecosystem (thanks for your insight, @vchuravy).
Upon revisiting this, I've just found that LLVM's SLP auto-vectorizer now seems to work in some cases for Julia v0.5. Lo and behold:
➜ data julia -O3 -q
julia> using ForwardDiff: Dual
julia> a = Dual(1., (2., 3.))
Dual(1.0,2.0,3.0)
julia> b = Dual(4., (5., 6.))
Dual(4.0,5.0,6.0)
julia> @code_llvm a + b
define void @"julia_+_70898"(%Dual* noalias sret, %Dual*, %Dual*) #0 {
top:
%3 = getelementptr inbounds %Dual, %Dual* %1, i64 0, i32 1, i32 0, i64 1
%4 = load double, double* %3, align 8
%5 = getelementptr inbounds %Dual, %Dual* %2, i64 0, i32 1, i32 0, i64 1
%6 = load double, double* %5, align 8
%7 = fadd double %4, %6
%8 = bitcast %Dual* %1 to <2 x double>*
%9 = load <2 x double>, <2 x double>* %8, align 8
%10 = bitcast %Dual* %2 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = fadd <2 x double> %9, %11
%13 = bitcast %Dual* %0 to <2 x double>*
store <2 x double> %12, <2 x double>* %13, align 8
%14 = getelementptr inbounds %Dual, %Dual* %0, i64 0, i32 1, i32 0, i64 1
store double %7, double* %14, align 8
ret void
}
julia> @code_native a + b
.section __TEXT,__text,regular,pure_instructions
Filename: dual.jl
pushq %rbp
movq %rsp, %rbp
Source line: 114
vmovsd 16(%rsi), %xmm0 ## xmm0 = mem[0],zero
vaddsd 16(%rdx), %xmm0, %xmm0
Source line: 182
vmovupd (%rsi), %xmm1
vaddpd (%rdx), %xmm1, %xmm1
vmovupd %xmm1, (%rdi)
vmovsd %xmm0, 16(%rdi)
movq %rdi, %rax
popq %rbp
retq
nopw %cs:(%rax,%rax)
If I don't use -O3
, then the LLVM IR doesn't show any vector instructions , but the generated assembly is basically the same (this benefit is probably target-specific). EDIT: I got a vaddsd
mixed up with a vaddpd
; you do, in fact, need -O3
to get it to vectorize.
Note that the example in this PR's initial description also now auto-vectorizes for me (in the LLVM IR + native assembly with -O3
, and only in the native assembly without ).-O3
@jrevels Very interesting, indeed. If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?
Also noted in the end of https://github.com/JuliaArrays/StaticArrays.jl/blob/master/README.md#speed.
However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.
However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.
It's working for me (also works for the simd_sum
example):
julia> a = Dual(1., (2., 3., 4., 5.))
Dual(1.0,2.0,3.0,4.0,5.0)
julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0,5.0)
julia> @code_llvm a + b
define void @"julia_+_70874"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
%3 = getelementptr inbounds %Dual.66, %Dual.66* %1, i64 0, i32 1, i32 0, i64 3
%4 = load double, double* %3, align 8
%5 = getelementptr inbounds %Dual.66, %Dual.66* %2, i64 0, i32 1, i32 0, i64 3
%6 = load double, double* %5, align 8
%7 = fadd double %4, %6
%8 = bitcast %Dual.66* %1 to <4 x double>*
%9 = load <4 x double>, <4 x double>* %8, align 8
%10 = bitcast %Dual.66* %2 to <4 x double>*
%11 = load <4 x double>, <4 x double>* %10, align 8
%12 = fadd <4 x double> %9, %11
%13 = bitcast %Dual.66* %0 to <4 x double>*
store <4 x double> %12, <4 x double>* %13, align 8
%14 = getelementptr inbounds %Dual.66, %Dual.66* %0, i64 0, i32 1, i32 0, i64 3
store double %7, double* %14, align 8
ret void
}
It can even vectorize operations on the values + partials simultaneously, if the lengths work out correctly:
julia> a = Dual(1., 2., 3., 4.)
Dual(1.0,2.0,3.0,4.0)
julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0)
julia> @code_llvm a + b
define void @"julia_+_71231"(%Dual.74* noalias sret, %Dual.74*, %Dual.74*) #0 {
top:
%3 = bitcast %Dual.74* %1 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = bitcast %Dual.74* %2 to <4 x double>*
%6 = load <4 x double>, <4 x double>* %5, align 8
%7 = fadd <4 x double> %4, %6
%8 = bitcast %Dual.74* %0 to <4 x double>*
store <4 x double> %7, <4 x double>* %8, align 8
ret void
}
If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?
Possibly. I guess it could depend on the context-sensitivity of the auto-vectorizer (it's LLVM's SLP vectorizer, BTW, not Julia's) - I don't know if it works in every case. So far, though, it seems to work where I'd expect it to work. My original goal was to vectorize arithmetic operations on Partials
, which theoretically are perfect candidates for SLP vectorization, but it might be possible to leverage SIMD in more advanced ways depending on how you're using Dual
numbers.
Anyway, do we consider this issue closed since it now works with -O3
?
vectorize the innermost partials of a nested Dual type
Just had to make sure this works, and it does (soooo cool that this works):
julia> a = Dual(Dual(1., 2.), Dual(3., 4.))
Dual(Dual(1.0,2.0),Dual(3.0,4.0))
julia> b = copy(a)
Dual(Dual(1.0,2.0),Dual(3.0,4.0))
julia> @code_llvm a + b
define void @"julia_+_70918"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
%3 = bitcast %Dual.66* %1 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = bitcast %Dual.66* %2 to <4 x double>*
%6 = load <4 x double>, <4 x double>* %5, align 8
%7 = fadd <4 x double> %4, %6
%8 = bitcast %Dual.66* %0 to <4 x double>*
store <4 x double> %7, <4 x double>* %8, align 8
ret void
}
This line %3 = bitcast %Dual.66* %1 to <4 x double>*
is really cool to see. The only benefit of using SIMD.jl is now that we could have these optimisations without -O3
(I have to check which passes we are running with -O3
).
I don't think we currently can define a convert method between <4 x double>
and a struct with llvmcall. And reinterpret only works for bitstypes.
I tried various operations on nested duals, and it appears that the SIMD instructions appear in more complex cases as well -- as long as I specify -O3
.
I did not see <4 x double>
, but merely <2 x double>
instructions for @jrevels's example above. Does the compiled LLVM depend in any way on my native architecture (core i7)?
UPDATE: Apparently <4 x double>
works just fine if I compile my own Julia instead of using the prebuilt 0.5.0 release...
There are many factors that are at play here:
avx
and avx2
is necessary, but even then the LLVM cost model might decide to otherwisecat /proc/cpuinfo
on Linux will give you this information, (take a look at flags).It is a shame that we can't do:
immutable Double
x :: Float64
y :: Float64
end
reinterpret(NTuple{2,VecElement{Float64}}, Double(1.0, 2.0))
but this works:
reinterpret(NTuple{2,VecElement{Float64}}, [Double(1.0, 2.0)])
Now we just need to lobby to enable the vector passes with default -O level :)
Would be interesting to run the benchmarks with and without -O3 to see if it gives any cool speedups.
-O3 on left:
Hessian and gradient seems faster?
Yeah, I saw similar perf differences on my machine. Wonder what's causing the slowdown for some of the benchmarks? It could be from some other pass enabled via -O3
besides the SLP vectorizer (though the same case can be made for the speed-ups). Is the alias analysis pass the only other -O3
-specific pass?
Perhaps the way forward is just to document this and close the issue? I'm pretty mindblown how good code that is generated with O3... When did this start happening (or have we just not tried it before)? Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?
Perhaps the way forward is just to document this and close the issue?
I agree with this.
When did this start happening (or have we just not tried it before)?
I played around with trying to get the SLP vectorizer to actually trigger/emit vectorized instructions for ForwardDiff's old number types back in the days of Julia v0.3/early v0.4, but never succeeded. When @dbeach24 started reviving discussion here, I remembered that I should try again with the new Dual
type and Julia v0.5.
Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?
Yeah, I was really excited to see that 😃
Is there any way we can add a test for regressions here?
We could have a bunch of tests like:
@test contains(sprint(io -> Base.code_llvm(io, +, (typeof(a), typeof(b)))), "fadd <4 x double>")
Would this be too fragile?
It would fail if you forgot to run with -O3
. It would also fail if you're using the prebuilt Julia-0.5.0 for Mac, since (as I recently discovered) that doesn't include support for <4 x double>
operations.
Well yeah, we'd have to put -O3 in the Travis script to catch SIMD regressions. The LLVM IR generated should be the same either way, no (i.e. it's just that the native code won't include SIMD)? Do we emit different IR based on the architecture? I thought it was LLVM's job to abstract over the architecture.
For local testing, we might be able to get the optimization level from within Julia, then only run the SIMD tests if it's 3
.
IR is also different depending on sysimg is locally compiled or not.
A standrad reduction function like:
which vectorizes with
Vector{Float64}
fails to vectorize with ForwardDiff numbers. For example:It is possible that there is nothing that can be done on the ForwardDiff side to solve this but I thought it could be interesting to track nonetheless.