Use with ForwardDiff - Githubissues

I was trying to use this with ForwardDiff, but not an array of structs like #19, just using dual numbers within a loop and then extracting ordinary numbers. Something like this:

dA = ForwardDiff.Dual(0.0, (1.0, 0.0))
dB = ForwardDiff.Dual(0.0, (0.0, 1.0))
for i in axes(A,1)
    tmp = (A[i] + dA) * log(B[i] + dB)
    C[i] = ForwardDiff.partials(tmp, 1)
...

I got this to run with @avx on the loop, by filling in whatever methods were missing, such as:

function Base.:+(x::Dual{Z,T,D}, sv::SVec{N}) where {Z,T<:Number,D,N}
    duals = ntuple(n -> +(x, sv[n]), N)
    Dual(svec(val.(duals)), ntuple(d -> svec(partials.(duals, d)), D))
end

@inline val(d::Dual) = d.value
@inline svec(tup::NTuple{N,T}) where {N,T} = SVec{N,T}(tup...)

However the result is slower than without, and has many more allocations.

I wonder whether this is expected to work, and whether adding methods like this is right thing to do? I can tidy up an example if there isn't an obvious fatal flaw here.

The README also has an example with complex numbers.

Mind posting all the code for a reproducible example?

Allocations suggests there may be a type instability somewhere. Or perhaps a method (possibly in my code) defined with excess parameters in the where, e.g. foo(a) where {T}. I wish that threw an error, instead of running slowly and allocating memory.

EDIT: The key to remember is that you should keep like values in their own registers/vectors. That means you should not make vectors of dual numbers.

Currently, the best thing to do would be to have separate SVecs for every one of the partials and for the value itself. That is because LoopVectorization currently assumes they each occupy 1 register. If you define a Struct-of-SVecs, which would be the long-term ideal solution, each of these will occupy multiple registers.

But it'd probably actually be easiest to take the Struct-of-SVecs approach, and specify @avx unroll=1 to make sure it doesn't do anything excessive. You can try unroll=2 or slightly larger numbers too.

By Struct-of-SVec, I mean that your + definition should not be creating a SVec of Dual{<:Any,<:Number}, but aDual{<:Any,<:SVec}`.

I've also been trying to get back to working on defining a reverse-pass for LoopVectorization's loops, but I've been distracted for the past month.

OK, I think I see what you mean about keeping things as struct-of-SVecs, and a few definitions like this appear to be enough:

function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
    y = x.value + sv 
    ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
    Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end

@code_warntype is happy with that, but speed is improved by being a bit more cautious and avoiding typeof(y), current attempt is take 4 here.

The example in which I was timing things is now here, not so minimal!

I start to think that the problem is things like ForwardDiff.partials(tmp, 1). Commenting these out (ideally leaving C[j] = dC[j] + one(T)) speeds things up greatly. I tried writing instead tmp.partials.values[1] but then got errors from @avx. Likewise unroll=1 seems to cause errors on the gradient. Perhaps also strangely, I need using ForwardDiff: partials for things to work (only with @avx) which seems like some strange scope issue?

Looking at those benchmarks, there are a lot of allocations when using @avx, but not otherwise. Could you by any chance try and find out where they're coming from?

Perhaps also strangely, I need using ForwardDiff: partials for things to work (only with @avx) which seems like some strange scope issue?

Is that because of:

function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
    y = x.value + sv 
    ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
    Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end

? Or, would similar code (not involving SVec) work? partials isn't normally exported, because ForwardDiff doesn't export anything.

Re ForwardDiff: partials, that is in scope where I define these methods for + etc, although the latest version uses x.partials.values[d] instead of the function. But inside the @avx loop I had to keep the function, and the macro defines it $ForwardDiff.partials(x, $d), which is enough for the non-avx loop to see it.

Weirdly, after fiddling with seemingly unrelated things since then, this scope issue has gone away for code made by the macro. But persists in the ∇apply!109 function I copied into that file. And they have very different numbers of allocations, although both are slow:

julia> @btime Tracker.gradient(sum∘unfill∘f_fwd_avx, $A, $B, $C); # newer macro's code
  2.448 ms (89 allocations: 2.47 MiB)

julia> @btime Tracker.gradient(sum∘unfill∘create109, $A, $B, $C); # won't run without using ForwardDiff: partials
  2.754 ms (180065 allocations: 18.03 MiB)

In ∇apply!109 there are some attempts to avoid this function by writing ℛℰ𝒮.partials[3] etc, all of which errors from @avx. Is getfield expected to work?

Yes, that's definitely a problem with @avx:

julia> @macroexpand @avx for i = 📏i
           for j = 📏j
               ℛℰ𝒮 = (A[i] + 𝜀A) * log((B[i, j] + 𝜀B) / (C[j] + 𝜀C))
               𝛥B[i, j] = 𝛥B[i, j] + (ForwardDiff).partials(ℛℰ𝒮, 1) * 𝛥ℛℰ𝒮[i]
               𝛥C[j] = 𝛥C[j] + (ForwardDiff).partials(ℛℰ𝒮, 2) * 𝛥ℛℰ𝒮[i]
               𝛥A[i] = 𝛥A[i] + (ForwardDiff).partials(ℛℰ𝒮, 3) * 𝛥ℛℰ𝒮[i]
           end
       end
quote
    var"##loopi#298" = LoopVectorization.maybestaticrange(📏i)
    var"##i_loop_lower_bound#299" = LoopVectorization.maybestaticfirst(var"##loopi#298")
    var"##i_loop_upper_bound#300" = LoopVectorization.maybestaticlast(var"##loopi#298")
    var"##loopj#301" = LoopVectorization.maybestaticrange(📏j)
    var"##j_loop_lower_bound#302" = LoopVectorization.maybestaticfirst(var"##loopj#301")
    var"##j_loop_upper_bound#303" = LoopVectorization.maybestaticlast(var"##loopj#301")
    var"##vptr##_A" = LoopVectorization.stridedpointer(A)
    var"##vptr##_B" = LoopVectorization.stridedpointer(B)
    var"##vptr##_C" = LoopVectorization.stridedpointer(C)
    var"##vptr##_𝛥B" = LoopVectorization.stridedpointer(𝛥B)
    var"##vptr##_𝛥ℛℰ𝒮" = LoopVectorization.stridedpointer(𝛥ℛℰ𝒮)
    var"##vptr##_𝛥C" = LoopVectorization.stridedpointer(𝛥C)
    var"##vptr##_𝛥A" = LoopVectorization.stridedpointer(𝛥A)
    begin
        $(Expr(:gc_preserve, :(LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x02), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000405, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x07), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x08), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000708, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :/, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000609, LoopVectorization.compute, 0x00, 0x0a), :LoopVectorization, :log, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000000a, LoopVectorization.compute, 0x00, 0x0b), :LoopVectorization, :*, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000030b, LoopVectorization.compute, 0x00, 0x0c), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x0d), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c0d, LoopVectorization.compute, 0x00, 0x0e), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x04, 0x0f), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x05, 0x10), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x00000000000e0f10, LoopVectorization.compute, 0x00, 0x10), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000011, LoopVectorization.memstore, 0x05, 0x11), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x12), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c13, LoopVectorization.compute, 0x00, 0x13), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x07, 0x14), :numericconstant, Symbol("##reductzero#321"), LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x15), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000001, 0x0000000000000000, 0x0000000000140f16, LoopVectorization.compute, 0x00, 0x15), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000001715, LoopVectorization.compute, 0x00, 0x14), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000018, LoopVectorization.memstore, 0x07, 0x16), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x17), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c1a, LoopVectorization.compute, 0x00, 0x18), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x09, 0x19), :numericconstant, Symbol("##reductzero#327"), LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000002, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x1a), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000002, 0x0000000000000000, 0x00000000001b0f1d, LoopVectorization.compute, 0x00, 0x1a), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000001e1c, LoopVectorization.compute, 0x00, 0x19), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x000000000000001f, LoopVectorization.memstore, 0x09, 0x1b)}, Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C,Symbol("##vptr##_C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥ℛℰ𝒮,Symbol("##vptr##_𝛥ℛℰ𝒮")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{}, Tuple{2, 5, 8}, Tuple{(19, 2), (26, 3)}, Tuple{}, Tuple{(22, LoopVectorization.IntOrFloat), (29, LoopVectorization.IntOrFloat)}, Tuple{(13, LoopVectorization.HardInt)}}, Tuple{:i, :j}, (var"##i_loop_lower_bound#299":var"##i_loop_upper_bound#300", var"##j_loop_lower_bound#302":var"##j_loop_upper_bound#303"), var"##vptr##_A", var"##vptr##_B", var"##vptr##_C", var"##vptr##_𝛥ℛℰ𝒮", var"##vptr##_𝛥B", var"##vptr##_𝛥B", var"##vptr##_𝛥C", var"##vptr##_𝛥C", var"##vptr##_𝛥A", var"##vptr##_𝛥A", 𝜀A, 𝜀B, 𝜀C, partials, partials, partials)), :A, :B, :C, :𝛥B, :𝛥ℛℰ𝒮, :𝛥C, :𝛥A))
    end
end

Notice that it's passing partials as arguments to the function _avx_!. It stripped the ForwardDiff. I'll fix this.

Also, I can't run your examples:

julia> f_fwd(A,B,C) = @tullio grad=Dual avx=false s[i] := A[i] * log(B[i,j] / C[j]);
ERROR: LoadError: can't understand LHS, expected A[i,j,k], got grad
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] _tullio(::Expr, ::Expr, ::Expr; multi::Bool, mod::Module) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:83
 [3] @tullio(::LineNumberNode, ::Module, ::Vararg{Any,N} where N) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:53
 [4] eval(::Module, ::Any) at ./boot.jl:331
 [5] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:132
 [6] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
 [7] top-level scope at none:1
in expression starting at REPL[37]:1

OK, great, had not looked hard at the expanded version. The alternative access like this gives immediate errors:

julia> @macroexpand @avx for i = 📏i
                  for j = 📏j
                      ℛℰ𝒮 = (A[i] + 𝜀A) * log((B[i, j] + 𝜀B) / (C[j] + 𝜀C))
                      𝛥B[i, j] = 𝛥B[i, j] + ℛℰ𝒮.partials[1] * 𝛥ℛℰ𝒮[i]
                   end
               end
ERROR: LoadError: TypeError: in typeassert, expected Symbol, got Expr

I think the error you get might be from master branch, which is last year's attempt, the right branch is coronavirusrewrite... sorry should get around to merging it.

Ah, checking out coronaviruswrite solves that problem.

Is getfield expected to work?

Not yet. PR's are welcome, otherwise I may get around to it, but probably not until after there's real support for non-hardware types.

On that note, what are A, B, and C? The same applies to the previous snippet I posted. If they're not arrays of hardware types, loading from them isn't supported yet either. These are things you could implement, preferably in the way the README demos adding support for Complex{Float64}. But you could also do it in the form of defining a load as a bunch of loads and shufflevectors. But I don't see where you're defining any such methods at the moment.

Could you also come up with a more minimal example of what you're trying to do? It's hard for me to see what's going on from the @tullio macro. I saw that it calls Main.💧82, which calls Main.💥82, and I stopped trying to dissect things from there.

A,B,C are arrays of Float64, and functions apply!/💥 are like mul! in that they work in-place, while create/💧 is like * that first makes the necessary output array. (I got tired of looking at gensyms!) Then ∇💧 is supposed to compute backwards gradients, and similarly calls in-place function ∇💥, which has some loops. Inside these loops there's a scalar computation which can be done using dual numbers, and this is the problematic part.

There are just three dual numbers as input (the same ones at every step), never an array of them. And ordinary numbers partials(res,1) are written back into arrays similar(A) etc. I was hoping that by juggling these Dual{SVec} things this inner "scalar" computation could be made efficient, these objects only have to exist for a few lines, and then get taken apart again.

I agree though that my example is not so minimal, sorry! I will have a go at boiling it down.

OK here's an example small enough just to paste in here:

using LoopVectorization, ForwardDiff, BenchmarkTools

function dualinv!(B,A)
    @assert size(A) == size(B)
    T = eltype(B)
    dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
    for i in eachindex(A)
        res = log(A[i] + dx)
        B[i] = ForwardDiff.partials(res,1)
    end
    B
end

dualinv!(rand(5), 1:5) == inv.(1:5)

function dualinv_avx!(B,A)
    @assert size(A) == size(B)
    T = eltype(B)
    dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
    @avx for i in eachindex(A)
        res = log(A[i] + dx)
        B[i] = ForwardDiff.partials(res,1)
    end
    B
end

dualinv_avx!(rand(5), collect(1:5.0)) # UndefVarError: partials not defined

using ForwardDiff: partials

dualinv_avx!(rand(5), collect(1:5.0)) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::ForwardDiff.Dual{Nothing,Float64,2})

using Tullio # branch coronavirusrewrite, just for Dual Svec definitions

dualinv_avx!(rand(5), collect(1:5.0)) == inv.(1:5)

A = rand(10^6); B = similar(A);

@btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)

@btime dualinv_avx!($B, $A); # 12.363 ms (750003 allocations: 49.59 MiB)

If you find basic SVec definitions missing that sound like they should be there, please file an issue (or pull request) at SIMDPirates or SLEEFPirates. SIMDPirates is full of llvmcalls, and SLEEFPirates has more elaborate definitions for special functions like exp and sin. Don't worry too much about which if it isn't immediately obvious to you.

This is for two reasons: 1) So everyone using LoopVectorization, or SIMDPirates/SLEEFPirates will have the definition available. 2) So I can make sure the implementation is efficient.

In the case of inv(::SVec{<:Any,<:Integer}), I simply forgot to define inv(v::SVec{<:Any,<:Integer}) = vinv(v). Note the difference in the code_native and resulting performance between SIMDPirates.vinv and Tullio.inv:

julia> using Tullio, SIMDPirates, BenchmarkTools

julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>

julia> @code_native debuginfo=:none SIMDPirates.vinv(xi)
        .text
        movq    %rdi, %rax
        vcvtqq2pd       (%rsi), %zmm0
        movabsq $.rodata.cst8, %rcx
        vbroadcastsd    (%rcx), %zmm1
        vdivpd  %zmm0, %zmm1, %zmm0
        vmovapd %zmm0, (%rdi)
        vzeroupper
        retq
        nopl    (%rax)

julia> @code_native debuginfo=:none inv(xi)
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %rbx
        andq    $-64, %rsp
        subq    $192, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %xmm0, 32(%rsp)
        movq    $0, 48(%rsp)
        movq    %fs:0, %rbx
        movq    $4, 32(%rsp)
        movq    -15712(%rbx), %rax
        movq    %rax, 40(%rsp)
        leaq    32(%rsp), %rax
        movq    %rax, -15712(%rbx)
        vmovups (%rdi), %zmm0
        vmovaps %zmm0, 64(%rsp)
        movabsq $ntuple, %rax
        leaq    64(%rsp), %rdi
        movl    $8, %esi
        vzeroupper
        callq   *%rax
        movq    %rax, 48(%rsp)
        movq    %rax, 56(%rsp)
        movabsq $jl_apply_generic, %rax
        movabsq $139793777923704, %rdi  # imm = 0x7F2446799678
        leaq    56(%rsp), %rsi
        movl    $1, %edx
        callq   *%rax
        movq    40(%rsp), %rcx
        movq    %rcx, -15712(%rbx)
        leaq    -8(%rbp), %rsp
        popq    %rbx
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)
        nopl    (%rax,%rax)

julia> @btime SIMDPirates.vinv($(Ref(xi))[])
  3.740 ns (0 allocations: 0 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>

julia> @btime inv($(Ref(xi))[])
  27.096 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>

OK, it would be great to fix things upstream. But do you just mean things in Base, or would SLEEFPirates be willing to depend on ForwardDiff to have these dual types?

Some of the same issues seem to arise for complex numbers, too. Here's a version of my example above, which similarly tries to use them at an intermediate step:

function cinv!(B,A)
    dx = im/100_000
    for i in eachindex(A)
        res = log(A[i] + dx)
        B[i] = 100_000 * imag(res)
    end
    B
end

cinv!(rand(5), collect(1:5.0)) ≈ inv.(1:5)

function cinv_avx!(B,A)
    dx = im/100_000
    @avx for i in eachindex(A)
        res = log(A[i] + dx)
        B[i] = 100_000 * imag(res)
    end
    B
end

cinv_avx!(rand(5), collect(1:5.0)) ≈ inv.(1:5) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::Complex{Float64})

And, here's what happens with inv on my computer (an i7-8700), @code_native differs at length 8 but not length 4:

julia> xi = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>

julia> @btime SIMDPirates.vinv($(Ref(xi))[])
  2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>

julia> @btime Tullio.inv($(Ref(xi))[])
  2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>

julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>

julia> @btime Tullio.inv($(Ref(xi))[])
  28.119 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>

julia> @btime SIMDPirates.vinv($(Ref(xi))[])
Segmentation fault: 11

On the local versions (I plan to push by the end of the day):

julia> using LoopVectorization, ForwardDiff, BenchmarkTools

julia> using ForwardDiff: Dual, Partials

julia> using LoopVectorization: SVec

julia> ForwardDiff.can_dual(::Type{<:SVec}) = true

julia> @inline function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T<:Number,D,N,S}
           y = x.value + sv
           ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D)) # avoiding partials(x.d) didn't help
           TS = SVec{N,promote_type(T,S)}
           Dual{Z,TS,D}(y, Partials{D,TS}(ps))
       end

julia> @inline function Base.:+(sv::SVec{N,S}, x::Dual{Z,T,D}) where {Z,T<:Number,D,N,S}
           y = x.value + sv
           ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D))
           TS = SVec{N,promote_type(T,S)}
           Dual{Z,TS,D}(y, Partials{D,TS}(ps))
       end

julia> @inline function Base.:*(x::Dual{Z,SVec{N,T},D}, sv::SVec{N,S}) where {Z,T,D,N,S}
           y = x.value * sv
           ps = ntuple(d -> x.partials.values[d] * sv, Val(D))
           TS = SVec{N,promote_type(T,S)}
           Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
       end

julia> @inline function Base.:*(sv::SVec{N,S}, x::Dual{Z,SVec{N,T},D}) where {Z,T,D,N,S}
           y = sv * x.value
           ps = ntuple(d -> sv * x.partials.values[d], Val(D))
           TS = SVec{N,promote_type(T,S)}
           Dual{Z,TS,D}(y, Partials{D,TS}(ps))
       end

julia> @inline function Base.:*(p::Partials{D,SVec{N,T}}, sv::SVec{N,S}) where {T,D,N,S}
           TS = SVec{N,promote_type(T,S)}
           Partials{D,TS}(ntuple(d -> p.values[d] * sv, Val(D)))
       end

julia> @inline function Base.:*(sv::SVec{N,S}, p::Partials{D,SVec{N,T}}) where {T,D,N,S}
           TS = SVec{N,promote_type(T,S)}
           Partials{D,TS}(ntuple(d -> sv * p.values[d], Val(D)))
       end

julia> function dualinv!(B,A)
           @assert size(A) == size(B)
           T = eltype(B)
           dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
           for i in eachindex(A)
               res = log(A[i] + dx)
               B[i] = ForwardDiff.partials(res,1)
           end
           B
       end
dualinv! (generic function with 1 method)

julia> dualinv!(rand(5), 1:5) == inv.(1:5)
true

julia> function dualinv_avx!(B,A)
           @assert size(A) == size(B)
           T = eltype(B)
           dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
           @avx for i in eachindex(A)
               res = log(A[i] + dx)
               B[i] = ForwardDiff.partials(res,1)
           end
           B
       end
dualinv_avx! (generic function with 1 method)

julia> dualinv_avx!(rand(5), collect(1:5.0))
5-element Array{Float64,1}:
 1.0
 0.5
 0.3333333333333333
 0.25
 0.2

julia> dualinv_avx!(rand(5), 1:5.0)
5-element Array{Float64,1}:
 1.0
 0.5
 0.3333333333333333
 0.25
 0.2

julia> A = rand(10^6); B = similar(A);

julia> @btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)
  5.684 ms (0 allocations: 0 bytes)

julia> @btime dualinv_avx!($B, $A);
  662.525 μs (0 allocations: 0 bytes)

Note that two of your Base.:* methods had a Z in the where clause that didn't show up in the argument types. I wish that'd throw an error. Instead, it silently causes your code to allocate and run slower.

And, here's what happens with inv on my computer (an i7-8700), @code_native differs at length 8 but not length 4:

Interesting. I see the same thing you do:

julia> using Tullio, SIMDPirates, BenchmarkTools

julia> xi4 = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>

julia> inv(xi4)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>

julia> @which inv(xi4)
inv(sv::SVec{N,var"#s24"} where var"#s24"<:Integer) where N in Tullio at /home/chriselrod/.julia/packages/Tullio/kCDFo/src/forward.jl:97

julia> @btime inv($(Ref(xi4))[])
  1.876 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>

julia> @btime SIMDPirates.vinv($(Ref(xi4))[])
  1.879 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>

julia> versioninfo()
Julia Version 1.5.0-DEV.645
Commit 0e23ff2581* (2020-04-15 17:02 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

Maybe there's a limit causing the compiler to give up somewhere between 4 and 8.

julia> @btime SIMDPirates.vinv($(Ref(xi))[])
Segmentation fault: 11

That should be fixed on Julia master.

On older versions of Julia, it only occurs when an SVec (or any struct containing an NTuple{N,Core.VecElement{T}} for Ns about which Julia and LLVM disagreed on their alignment) is returned from a function.

So something like

using SIMDPirates
xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
@btime extract_data(SIMDPirates.vinv($(Ref(xi))[]))

Might work.

Ah indeed, that crash was Julia 1.3, but it's fixed on 1.5.

And thanks, deleting that unused where Z fixes this example completely. It's identical to a loop with B[i] = 1/A[i]. (You did say but I didn't realise this was such a big deal!)

It's insidious! It's one of the first things I look for when I run into surprising amounts of allocations.

And thanks, deleting that unused where Z fixes this example completely. It's identical to a loop with B[i] = 1/A[i].

Nice. I was expecting that, because Base.log(::SVec) is inlined, so LLVM should be able to delete it once it sees the result is unused and that it log has no side effects (which it can tell, thanks to the inlining).

Some other methods like Base.exp(::SVec) aren't inlined on all architectures, so this makes me wonder if I should make that consistent. It can sometimes be important, like here. (exp isn't inlined on Linux with recent GLIBC, because it calls the GLIBC version with ccall.)

OK, it would be great to fix things upstream. But do you just mean things in Base, or would SLEEFPirates be willing to depend on ForwardDiff to have these dual types?

Hmm. ForwardDiff is a heavy dependency.

julia> @time using LoopVectorization
  0.396693 seconds (912.65 k allocations: 63.532 MiB)

julia> @time using ForwardDiff
  2.400366 seconds (10.07 M allocations: 489.370 MiB, 8.63% gc time)

Maybe we can create a ForwardDiffSVec or ForwardDiffLoopVectorization library? Doesn't have to be those names. It'd take me a while to create a new open source library (I'd have to file for approval, etc). But if you don't mind also depending on PaddedMatrices, I could use PaddedMatricesForwardDiff (which will need updating).

Otherwise, you could create a library, and I can add to VectorizationBase, SIMDPirates, and SLEEFPirates' READMEs that it is the official/blessed library for ForwardDiff support on SVecs.

Maybe to LoopVectorization's documentation as well, given that there's an example (like the complex number one) demonstrating how to use it.

Looks like https://github.com/JuliaLang/julia/issues/29393 is the issue about dangling type parameters.

Will fiddle a bit more first, but PaddedMatricesForwardDiff doesn't sound like a bad option to store these methods, since it already depends on all the relevant things.

Okay. You're welcome to make PRs there yourself. It may be a couple months before I work on it myself.

JuliaSIMD / LoopVectorization.jl

Use with ForwardDiff #93