Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD

KristofferC commented 8 years ago

A standrad reduction function like:

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

which vectorizes with Vector{Float64} fails to vectorize with ForwardDiff numbers. For example:

@code_llvm simd_sum(rand(Float64, 10))

G = ForwardDiff.GradientNumber{4, Float64, NTuple{4, Float64}}
G_vec = G[rand(G) for i = 1:10]
@code_llvm simd_sum(G_vec)

It is possible that there is nothing that can be done on the ForwardDiff side to solve this but I thought it could be interesting to track nonetheless.

KristofferC commented 8 years ago

Unfortunately, I am not well versed enough in LLVM IR to be able to pinpoint exactly what the problem is...

KristofferC commented 8 years ago

This generates very similar code and also fails to vectorize:


function tuple_simd_sum{T}(x::Vector{NTuple{4, T}})
    s = (0.0, 0.0, 0.0, 0.0)
    @inbounds @simd for i in eachindex(x)
        x_i = x[i]
        s = (s[1] + x_i[1], s[2] + x_i[2], s[3] + x_i[3], s[4] + x_i[4])
    end
    return s
end

tuple_vec = [(rand(), rand(), rand(), rand()) for i = 1:20]

@code_llvm tuple_simd_sum(tuple_vec)

mlubin commented 8 years ago

If moving from NTuple to another storage type for the epsilon components will make it easier to get access to SIMD, I'd say it's worth considering.

KristofferC commented 8 years ago

I don't know what other storage type that would be. A vector of GradientNumber is already packed efficiently in contiguous memory so I don't think the storage type is wrong, it is just that it is translated to LLVM IR in a non optimal way.

There are some patches pending to LLVM that might help https://github.com/JuliaLang/julia/pull/5355#issuecomment-176335389

It would be interesting to experiment with https://github.com/eschnett/SIMD.jl when it is mature enough.

jrevels commented 8 years ago

Using NTuple storage for epsilon components will allow us to manually leverage SIMD using llvmcall, such that Parials{N,T,NTuple{N,T}} (in an upcoming refactor, just Partials{N,T}) behaves like SIMD.jl's Vec{N,T}.

I think manually calling SIMD instructions might be the most consistent way to resolve this issue (instead of having to rely on the @simd heuristics in Base).

KristofferC commented 8 years ago

I made a small proof of concept using SIMD.jls Vec for the dual part at: https://github.com/JuliaDiff/ForwardDiff.jl/commit/3fcd7966941e3f1046fabdfdbd414680771fb4c1. I basically threw out the whole PartialVec just to get it working as fast as possible.

Doing something like:

using ForwardDiff
using SIMD

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]

@code_native simd_sum(vec)

show a bunch of vectorized operations being used. It is however about twice as slow as the original code hehe. Maybe my implementation is not perfect or SIMD.jl needs a bit more battle experience. Still pretty interesting.

KristofferC commented 8 years ago

I made another proof of concept on getting the operations with the partial numbers becoming SIMD accelerated.

First I started on a package SIMDVectors.jl: https://github.com/KristofferC/SIMDVectors.jl It uses https://github.com/JuliaLang/julia/pull/15244 to generate vectorized operations for a type similar to a tuple.

I then made a GradientNumber and Partials that uses this SIMDVector as its Partials.data field: https://gist.github.com/KristofferC/56594a9c57d6f27df769

The result is quite nicely generated SIMD code:

 include(simd_fdiff.jl)
 using SIMDVectors

a = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
b = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))

@code_native a + b

.
    vaddpd  16(%rdx), %xmm5, %xmm5
    vaddpd  32(%rdx), %xmm4, %xmm4
    vaddpd  48(%rdx), %xmm3, %xmm3
    vaddpd  64(%rdx), %xmm2, %xmm2
    vaddsd  80(%rdx), %xmm1, %xmm1
    vaddsd  (%rdx), %xmm0, %xmm0
.

@code_native a * b

julia> @code_native a * b
.
    vmulpd  32(%rsi), %xmm1, %xmm10
    vmulpd  48(%rsi), %xmm1, %xmm11
    vmulpd  64(%rsi), %xmm1, %xmm8
Source line: 6
    vmulsd  80(%rsi), %xmm0, %xmm5
Source line: 34
    vmovsd  (%rdx), %xmm6           # xmm6 = mem[0],zero
Source line: 8
    vmovddup    %xmm6, %xmm7    # xmm7 = xmm6[0,0]
    vmulpd  16(%rdx), %xmm7, %xmm1
    vmulpd  32(%rdx), %xmm7, %xmm2
    vmulpd  48(%rdx), %xmm7, %xmm3
    vmulpd  64(%rdx), %xmm7, %xmm7
Source line: 6
    vmulsd  80(%rdx), %xmm6, %xmm4
Source line: 8
    vaddpd  %xmm1, %xmm9, %xmm1
    vaddpd  %xmm2, %xmm10, %xmm2
    vaddpd  %xmm3, %xmm11, %xmm3
    vaddpd  %xmm7, %xmm8, %xmm7
Source line: 6
    vaddsd  %xmm4, %xmm5, %xmm4
    vmulsd  %xmm6, %xmm0, %xmm0
.```

KristofferC commented 8 years ago

Using SIMD.jl with the new #15244 I get this for a multiplication with two gradient numbers:

julia> @code_native a*b
.
    vbroadcastsd    %xmm1, %ymm2
    vmulpd  32(%rsi), %ymm2, %ymm2
    vbroadcastsd    %xmm0, %ymm3
    vmulpd  32(%rdx), %ymm3, %ymm3
    vaddpd  %ymm2, %ymm3, %ymm2
Source line: 65
    vmulsd  %xmm1, %xmm0, %xmm0
.

which uses lovely AVX 256 bit instructions. The example in the OP also runs faster.

jrevels commented 8 years ago

This is so cool. I suppose now the race is on for me to wrap up #102 before JuliaLang/julia#15244 lands.

KristofferC commented 8 years ago

There are of course a few caveats, SIMD Vectors are limited to length 16 and SIMD.jl has no support for other than C types. I think the best thing would be to let the user chose Vec or Tuple as the backend storage for Partials when creating the gradient function, default to Tuple of course.

It shouldn't be that much code since only simple operations are needed on Partials.

KristofferC commented 8 years ago

I guess that means no nested differentiation as well..

mlubin commented 8 years ago

SIMD + nested differentiation is not a use case that I care too much about

mlubin commented 8 years ago

So how much faster will a multiplication of two 16-length gradient numbers be? 4x faster?

KristofferC commented 8 years ago

Theoretically yes, but I'm segfaulting now when I run @benchmark so can't actually properly test it.

dbeach24 commented 7 years ago

Couldn't SIMDVector still be used for the "innermost" partials of a nested differentiation?

KristofferC commented 7 years ago

The best would be if whatever is used to hold the partials would just degenerate to a normal tuple when the type is not something that simd would work on. As in https://github.com/KristofferC/SIMDVectors.jl#user-defined-number-types. SIMD.jl is better than that package though so adding support to that would be nice.

dbeach24 commented 7 years ago

@jrevels I was just reading about and experimenting with Julia 0.5's SIMD support. Are there any technical hurdles to implementing SIMD optimizations for the Partials type at this time?

As I think you suggested earlier, it seems like Partials{N,T} could be defined as:

immutable Partials{N,T}
    values::NTuple{N,VecElement{T}}
end

With the corresponding tupexpr functions specialized to emit LLVM SIMD opcodes for the cases where T is a bitstype.

Would there be any undue overhead for cases where T is not a bitstype?
Do you expect compile-times to slow due to the increased type nesting?

KristofferC commented 7 years ago

Note

However, since the automatic vectorization cannot be relied upon, future use will mostly be via libraries that use llvmcall.

This is what SIMD.jl is doing.

dbeach24 commented 7 years ago

@KristofferC Understood. Also looked at SIMD.jl.

I thought @jrevels expressed the opinion that he would prefer to insert the llvmcall code directly into ForwardDiff.jl, rather than rely on an external dependency. (At least that's how I interpreted his above comment on Feb 3.)

Is SIMD.jl mature enough to be used as a dependency at this time?

KristofferC commented 7 years ago

Since we do not require many fancy operations on the partials it should be possible to just roll our own. When I experimented with this I got random crashes when trying to benchmark but perhaps the situation had improved.

dbeach24 commented 7 years ago

Attempting to answer my first question about this approach, i.e. what is the cost of defining the partials as NTuple{N, VecElement{T}}?

I created a branch of ForwardDiff.jl where the Partials are defined in this way (for Julia 0.5 only). No SIMD optimizations have been added (yet) -- only the storage of the partials is changed. For 1st derivatives, the performance seems to be unchanged (expected). However, I'm seeing a 2-4x slowdown when using nested Duals, and this is certainly problematic. This problem will need to be solved before it's even worth thinking about introducing the SIMD operations.

It would be really great to change the definition of type Partials{N,T} type, based on the value of the type parameters (i.e. if T is a bitstype, wrap it in VecElement, otherwise, just use a normal NTuple{N,T}).

Is there any good way to do this?

Otherwise, we may need to have multiple types of Dual, or (alternatively) allow a Dual's value and partials types to be different.

Thoughts?

KristofferC commented 7 years ago

What is the problem with using VecElement if T is not a bits type? It should degenerate to exactly the same LLVM representation as a tuple of T for non SIMD supported T.

dbeach24 commented 7 years ago

The problem with nesting Dual within VecElement seems to be performance. (The code still appears to run correctly.) It's possible that I've done something wrong in my code, but I'm only seeing the performance issue when computing 2nd (or higher) derivatives.

KristofferC commented 7 years ago

How does the generated code look for something really simple like adding a two nested dual numbers with VecElement / Tuple?

dbeach24 commented 7 years ago

@KristofferC Please see the attached files which show the generated code and timings for sizes of {1,3,6,9} x {1,3,6,9}. In the 1x1, 3x1, 6x1, and 9x1 cases, there is virtually no difference in the generated code. However, as the number of partials are increased in the nested dual, the generated code becomes considerably lengthier (with the 9x9 case being quite lengthy, indeed.)

This seems to imply that VecElement has a cost, and that, sadly, that cost appears in exactly the place we were hoping to create the performance boost: when the innermost Dual has high dimension. My theory is that the trouble comes from nesting the inner-Duals inside a VecElement container. If we can somehow peel away this wrapping for the outer-Duals, but leave it intact for the innermost Duals (i.e. where we need it), then perhaps we can still achieve the desired optimization.

dualtest.jl.txt output_diff_sidebyside.txt output_vecel_disabed.txt output_vecel_enabled.txt

UPDATE: It appears that the generated code for (m by n; n > 1) duals is suffering from unnecessary allocations and memory copy calls.

dbeach24 commented 7 years ago

@jrevels @KristofferC I'm now taking a different tack:

Introduce a new partials type SIMDPartials. It has all the same operations as Partials except that the storage is a tuple of VecElement{T}.
Lift the restriction that Dual{N,T} must have the same value type and partials type.... something closer to Dual{N,V,P} where V is the value type, and P is the partials type (either Partials or SIMDPartials).
Check performance of higher derivatives where the innermost Dual uses SIMDPartials.

I'm banging away on this this afternoon and will see where I get. Let me know if you have any thoughts.

KristofferC commented 7 years ago

Hopefully that works. However, I don't think there is any inherit reason why nested Vecs shouldn't work just as well, so if you already have some code around that shows the problem, perhaps a Base issue would be good?

dbeach24 commented 7 years ago

Update: My above approach won't work. Thanks @KristofferC for encouraging me to further distill the problem and file an issue.

dbeach24 commented 7 years ago

Feel free to read JuliaLang/Julia#18857. It's possible that introducing the llvmcall optimizations will be sufficient to improve the performance, but it's starting to feel like a bit of a reach...

vchuravy commented 7 years ago

One of the problems with using Vec without lowering to the correct SIMD instructions is that LLVM will have to effectively unpack and repack at every step. More or less you tell LLVM that is has to store this data in a particular way and place (the SIMD registers) and then you move it out and pack into these register when doing work on them (and that is quite inefficient).

I have been working on supporting vector intrinsics in base Julia https://github.com/JuliaLang/julia/pull/18470, but I didn't make much progress on that recently. (The problem is what to do in the case that LLVM is not available). The goal of that work was to prevent people from having to roll their own llvmcall implementations. (LLVM IR is not stable)

In the mean time I would recommend using SIMD.jl it is a very nice package and it solved several of the hard problems already.

jrevels commented 7 years ago

@dbeach24 Just caught up with this issue after returning from vacation, thanks for looking into it. I'm a SIMD noob, and haven't really been following recent developments in Julia/the ecosystem (thanks for your insight, @vchuravy).

Upon revisiting this, I've just found that LLVM's SLP auto-vectorizer now seems to work in some cases for Julia v0.5. Lo and behold:

➜  data julia -O3 -q
julia> using ForwardDiff: Dual

julia> a = Dual(1., (2., 3.))
Dual(1.0,2.0,3.0)

julia> b = Dual(4., (5., 6.))
Dual(4.0,5.0,6.0)

julia> @code_llvm a + b

define void @"julia_+_70898"(%Dual* noalias sret, %Dual*, %Dual*) #0 {
top:
  %3 = getelementptr inbounds %Dual, %Dual* %1, i64 0, i32 1, i32 0, i64 1
  %4 = load double, double* %3, align 8
  %5 = getelementptr inbounds %Dual, %Dual* %2, i64 0, i32 1, i32 0, i64 1
  %6 = load double, double* %5, align 8
  %7 = fadd double %4, %6
  %8 = bitcast %Dual* %1 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = bitcast %Dual* %2 to <2 x double>*
  %11 = load <2 x double>, <2 x double>* %10, align 8
  %12 = fadd <2 x double> %9, %11
  %13 = bitcast %Dual* %0 to <2 x double>*
  store <2 x double> %12, <2 x double>* %13, align 8
  %14 = getelementptr inbounds %Dual, %Dual* %0, i64 0, i32 1, i32 0, i64 1
  store double %7, double* %14, align 8
  ret void
}

julia> @code_native a + b
    .section    __TEXT,__text,regular,pure_instructions
Filename: dual.jl
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 114
    vmovsd  16(%rsi), %xmm0         ## xmm0 = mem[0],zero
    vaddsd  16(%rdx), %xmm0, %xmm0
Source line: 182
    vmovupd (%rsi), %xmm1
    vaddpd  (%rdx), %xmm1, %xmm1
    vmovupd %xmm1, (%rdi)
    vmovsd  %xmm0, 16(%rdi)
    movq    %rdi, %rax
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

If I don't use -O3, then the LLVM IR doesn't show any vector instructions ~~, but the generated assembly is basically the same (this benefit is probably target-specific).~~ EDIT: I got a vaddsd mixed up with a vaddpd; you do, in fact, need -O3 to get it to vectorize.

Note that the example in this PR's initial description also now auto-vectorizes for me (in the LLVM IR + native assembly with -O3~~, and only in the native assembly without -O3~~).

dbeach24 commented 7 years ago

@jrevels Very interesting, indeed. If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?

KristofferC commented 7 years ago

Also noted in the end of https://github.com/JuliaArrays/StaticArrays.jl/blob/master/README.md#speed.

However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.

jrevels commented 7 years ago

However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.

It's working for me (also works for the simd_sum example):

julia> a = Dual(1., (2., 3., 4., 5.))
Dual(1.0,2.0,3.0,4.0,5.0)

julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0,5.0)

julia> @code_llvm a + b

define void @"julia_+_70874"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
  %3 = getelementptr inbounds %Dual.66, %Dual.66* %1, i64 0, i32 1, i32 0, i64 3
  %4 = load double, double* %3, align 8
  %5 = getelementptr inbounds %Dual.66, %Dual.66* %2, i64 0, i32 1, i32 0, i64 3
  %6 = load double, double* %5, align 8
  %7 = fadd double %4, %6
  %8 = bitcast %Dual.66* %1 to <4 x double>*
  %9 = load <4 x double>, <4 x double>* %8, align 8
  %10 = bitcast %Dual.66* %2 to <4 x double>*
  %11 = load <4 x double>, <4 x double>* %10, align 8
  %12 = fadd <4 x double> %9, %11
  %13 = bitcast %Dual.66* %0 to <4 x double>*
  store <4 x double> %12, <4 x double>* %13, align 8
  %14 = getelementptr inbounds %Dual.66, %Dual.66* %0, i64 0, i32 1, i32 0, i64 3
  store double %7, double* %14, align 8
  ret void
}

It can even vectorize operations on the values + partials simultaneously, if the lengths work out correctly:

julia> a = Dual(1., 2., 3., 4.)
Dual(1.0,2.0,3.0,4.0)

julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0)

julia> @code_llvm a + b

define void @"julia_+_71231"(%Dual.74* noalias sret, %Dual.74*, %Dual.74*) #0 {
top:
  %3 = bitcast %Dual.74* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast %Dual.74* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast %Dual.74* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?

Possibly. I guess it could depend on the context-sensitivity of the auto-vectorizer (it's LLVM's SLP vectorizer, BTW, not Julia's) - I don't know if it works in every case. So far, though, it seems to work where I'd expect it to work. My original goal was to vectorize arithmetic operations on Partials, which theoretically are perfect candidates for SLP vectorization, but it might be possible to leverage SIMD in more advanced ways depending on how you're using Dual numbers.

Anyway, do we consider this issue closed since it now works with -O3?

jrevels commented 7 years ago

vectorize the innermost partials of a nested Dual type

Just had to make sure this works, and it does (soooo cool that this works):

julia> a = Dual(Dual(1., 2.), Dual(3., 4.))
Dual(Dual(1.0,2.0),Dual(3.0,4.0))

julia> b = copy(a)
Dual(Dual(1.0,2.0),Dual(3.0,4.0))

julia> @code_llvm a + b

define void @"julia_+_70918"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
  %3 = bitcast %Dual.66* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast %Dual.66* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast %Dual.66* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

vchuravy commented 7 years ago

This line %3 = bitcast %Dual.66* %1 to <4 x double>* is really cool to see. The only benefit of using SIMD.jl is now that we could have these optimisations without -O3 (I have to check which passes we are running with -O3).

I don't think we currently can define a convert method between <4 x double> and a struct with llvmcall. And reinterpret only works for bitstypes.

dbeach24 commented 7 years ago

I tried various operations on nested duals, and it appears that the SIMD instructions appear in more complex cases as well -- as long as I specify -O3.

I did not see <4 x double>, but merely <2 x double> instructions for @jrevels's example above. Does the compiled LLVM depend in any way on my native architecture (core i7)?

UPDATE: Apparently <4 x double> works just fine if I compile my own Julia instead of using the prebuilt 0.5.0 release...

vchuravy commented 7 years ago

There are many factors that are at play here:

Are you compiling from source?
Are you using your system LLVM? (we patch LLVM to improve vectorization)
Hardware support in from of avx and avx2 is necessary, but even then the LLVM cost model might decide to otherwise
What is your precise made of CPU? cat /proc/cpuinfo on Linux will give you this information, (take a look at flags).

It is a shame that we can't do:

immutable Double
         x :: Float64
         y :: Float64
end
reinterpret(NTuple{2,VecElement{Float64}}, Double(1.0, 2.0))

but this works: reinterpret(NTuple{2,VecElement{Float64}}, [Double(1.0, 2.0)])

see https://github.com/JuliaLang/julia/issues/10140

KristofferC commented 7 years ago

Now we just need to lobby to enable the vector passes with default -O level :)

KristofferC commented 7 years ago

Would be interesting to run the benchmarks with and without -O3 to see if it gives any cool speedups.

KristofferC commented 7 years ago

-O3 on left:

Hessian and gradient seems faster?

jrevels commented 7 years ago

Yeah, I saw similar perf differences on my machine. Wonder what's causing the slowdown for some of the benchmarks? It could be from some other pass enabled via -O3 besides the SLP vectorizer (though the same case can be made for the speed-ups). Is the alias analysis pass the only other -O3-specific pass?

KristofferC commented 7 years ago

Perhaps the way forward is just to document this and close the issue? I'm pretty mindblown how good code that is generated with O3... When did this start happening (or have we just not tried it before)? Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?

jrevels commented 7 years ago

Perhaps the way forward is just to document this and close the issue?

I agree with this.

When did this start happening (or have we just not tried it before)?

I played around with trying to get the SLP vectorizer to actually trigger/emit vectorized instructions for ForwardDiff's old number types back in the days of Julia v0.3/early v0.4, but never succeeded. When @dbeach24 started reviving discussion here, I remembered that I should try again with the new Dual type and Julia v0.5.

Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?

Yeah, I was really excited to see that 😃

mlubin commented 7 years ago

Is there any way we can add a test for regressions here?

jrevels commented 7 years ago

We could have a bunch of tests like:

@test contains(sprint(io ->  Base.code_llvm(io, +, (typeof(a), typeof(b)))), "fadd <4 x double>")

Would this be too fragile?

dbeach24 commented 7 years ago

It would fail if you forgot to run with -O3. It would also fail if you're using the prebuilt Julia-0.5.0 for Mac, since (as I recently discovered) that doesn't include support for <4 x double> operations.

jrevels commented 7 years ago

Well yeah, we'd have to put -O3 in the Travis script to catch SIMD regressions. The LLVM IR generated should be the same either way, no (i.e. it's just that the native code won't include SIMD)? Do we emit different IR based on the architecture? I thought it was LLVM's job to abstract over the architecture.

jrevels commented 7 years ago

For local testing, we might be able to get the optimization level from within Julia, then only run the SIMD tests if it's 3.

KristofferC commented 7 years ago

IR is also different depending on sysimg is locally compiled or not.

JuliaDiff / ForwardDiff.jl

Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD #98