Open mcabbott opened 4 years ago
The README also has an example with complex numbers.
Mind posting all the code for a reproducible example?
Allocations suggests there may be a type instability somewhere.
Or perhaps a method (possibly in my code) defined with excess parameters in the where
, e.g. foo(a) where {T}
. I wish that threw an error, instead of running slowly and allocating memory.
EDIT: The key to remember is that you should keep like values in their own registers/vectors. That means you should not make vectors of dual numbers.
Currently, the best thing to do would be to have separate SVec
s for every one of the partials and for the value itself.
That is because LoopVectorization
currently assumes they each occupy 1 register. If you define a Struct-of-SVec
s, which would be the long-term ideal solution, each of these will occupy multiple registers.
But it'd probably actually be easiest to take the Struct-of-SVecs
approach, and specify @avx unroll=1
to make sure it doesn't do anything excessive. You can try unroll=2
or slightly larger numbers too.
By Struct-of-SVec
, I mean that your +
definition should not be creating a SVec of Dual{<:Any,<:Number}, but a
Dual{<:Any,<:SVec}`.
I've also been trying to get back to working on defining a reverse-pass for LoopVectorization's loops, but I've been distracted for the past month.
OK, I think I see what you mean about keeping things as struct-of-SVecs, and a few definitions like this appear to be enough:
function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value + sv
ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end
@code_warntype
is happy with that, but speed is improved by being a bit more cautious and avoiding typeof(y)
, current attempt is take 4 here.
The example in which I was timing things is now here, not so minimal!
I start to think that the problem is things like ForwardDiff.partials(tmp, 1)
. Commenting these out (ideally leaving C[j] = dC[j] + one(T)
) speeds things up greatly. I tried writing instead tmp.partials.values[1]
but then got errors from @avx
. Likewise unroll=1
seems to cause errors on the gradient. Perhaps also strangely, I need using ForwardDiff: partials
for things to work (only with @avx
) which seems like some strange scope issue?
Looking at those benchmarks, there are a lot of allocations when using @avx
, but not otherwise.
Could you by any chance try and find out where they're coming from?
Perhaps also strangely, I need
using ForwardDiff: partials
for things to work (only with@avx
) which seems like some strange scope issue?
Is that because of:
function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value + sv
ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end
?
Or, would similar code (not involving SVec
) work? partials
isn't normally exported, because ForwardDiff
doesn't export anything.
Re ForwardDiff: partials
, that is in scope where I define these methods for +
etc, although the latest version uses x.partials.values[d]
instead of the function. But inside the @avx
loop I had to keep the function, and the macro defines it $ForwardDiff.partials(x, $d)
, which is enough for the non-avx loop to see it.
Weirdly, after fiddling with seemingly unrelated things since then, this scope issue has gone away for code made by the macro. But persists in the ∇apply!109
function I copied into that file. And they have very different numbers of allocations, although both are slow:
julia> @btime Tracker.gradient(sum∘unfill∘f_fwd_avx, $A, $B, $C); # newer macro's code
2.448 ms (89 allocations: 2.47 MiB)
julia> @btime Tracker.gradient(sum∘unfill∘create109, $A, $B, $C); # won't run without using ForwardDiff: partials
2.754 ms (180065 allocations: 18.03 MiB)
In ∇apply!109
there are some attempts to avoid this function by writing ℛℰ𝒮.partials[3]
etc, all of which errors from @avx
. Is getfield
expected to work?
Yes, that's definitely a problem with @avx
:
julia> @macroexpand @avx for i = 📏i
for j = 📏j
ℛℰ𝒮 = (A[i] + 𝜀A) * log((B[i, j] + 𝜀B) / (C[j] + 𝜀C))
𝛥B[i, j] = 𝛥B[i, j] + (ForwardDiff).partials(ℛℰ𝒮, 1) * 𝛥ℛℰ𝒮[i]
𝛥C[j] = 𝛥C[j] + (ForwardDiff).partials(ℛℰ𝒮, 2) * 𝛥ℛℰ𝒮[i]
𝛥A[i] = 𝛥A[i] + (ForwardDiff).partials(ℛℰ𝒮, 3) * 𝛥ℛℰ𝒮[i]
end
end
quote
var"##loopi#298" = LoopVectorization.maybestaticrange(📏i)
var"##i_loop_lower_bound#299" = LoopVectorization.maybestaticfirst(var"##loopi#298")
var"##i_loop_upper_bound#300" = LoopVectorization.maybestaticlast(var"##loopi#298")
var"##loopj#301" = LoopVectorization.maybestaticrange(📏j)
var"##j_loop_lower_bound#302" = LoopVectorization.maybestaticfirst(var"##loopj#301")
var"##j_loop_upper_bound#303" = LoopVectorization.maybestaticlast(var"##loopj#301")
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##vptr##_B" = LoopVectorization.stridedpointer(B)
var"##vptr##_C" = LoopVectorization.stridedpointer(C)
var"##vptr##_𝛥B" = LoopVectorization.stridedpointer(𝛥B)
var"##vptr##_𝛥ℛℰ𝒮" = LoopVectorization.stridedpointer(𝛥ℛℰ𝒮)
var"##vptr##_𝛥C" = LoopVectorization.stridedpointer(𝛥C)
var"##vptr##_𝛥A" = LoopVectorization.stridedpointer(𝛥A)
begin
$(Expr(:gc_preserve, :(LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x02), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000405, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x07), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x08), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000708, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :/, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000609, LoopVectorization.compute, 0x00, 0x0a), :LoopVectorization, :log, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000000a, LoopVectorization.compute, 0x00, 0x0b), :LoopVectorization, :*, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000030b, LoopVectorization.compute, 0x00, 0x0c), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x0d), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c0d, LoopVectorization.compute, 0x00, 0x0e), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x04, 0x0f), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x05, 0x10), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x00000000000e0f10, LoopVectorization.compute, 0x00, 0x10), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000011, LoopVectorization.memstore, 0x05, 0x11), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x12), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c13, LoopVectorization.compute, 0x00, 0x13), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x07, 0x14), :numericconstant, Symbol("##reductzero#321"), LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x15), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000001, 0x0000000000000000, 0x0000000000140f16, LoopVectorization.compute, 0x00, 0x15), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000001715, LoopVectorization.compute, 0x00, 0x14), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000018, LoopVectorization.memstore, 0x07, 0x16), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x17), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c1a, LoopVectorization.compute, 0x00, 0x18), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x09, 0x19), :numericconstant, Symbol("##reductzero#327"), LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000002, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x1a), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000002, 0x0000000000000000, 0x00000000001b0f1d, LoopVectorization.compute, 0x00, 0x1a), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000001e1c, LoopVectorization.compute, 0x00, 0x19), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x000000000000001f, LoopVectorization.memstore, 0x09, 0x1b)}, Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C,Symbol("##vptr##_C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥ℛℰ𝒮,Symbol("##vptr##_𝛥ℛℰ𝒮")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{}, Tuple{2, 5, 8}, Tuple{(19, 2), (26, 3)}, Tuple{}, Tuple{(22, LoopVectorization.IntOrFloat), (29, LoopVectorization.IntOrFloat)}, Tuple{(13, LoopVectorization.HardInt)}}, Tuple{:i, :j}, (var"##i_loop_lower_bound#299":var"##i_loop_upper_bound#300", var"##j_loop_lower_bound#302":var"##j_loop_upper_bound#303"), var"##vptr##_A", var"##vptr##_B", var"##vptr##_C", var"##vptr##_𝛥ℛℰ𝒮", var"##vptr##_𝛥B", var"##vptr##_𝛥B", var"##vptr##_𝛥C", var"##vptr##_𝛥C", var"##vptr##_𝛥A", var"##vptr##_𝛥A", 𝜀A, 𝜀B, 𝜀C, partials, partials, partials)), :A, :B, :C, :𝛥B, :𝛥ℛℰ𝒮, :𝛥C, :𝛥A))
end
end
Notice that it's passing partials
as arguments to the function _avx_!
. It stripped the ForwardDiff
.
I'll fix this.
Also, I can't run your examples:
julia> f_fwd(A,B,C) = @tullio grad=Dual avx=false s[i] := A[i] * log(B[i,j] / C[j]);
ERROR: LoadError: can't understand LHS, expected A[i,j,k], got grad
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] _tullio(::Expr, ::Expr, ::Expr; multi::Bool, mod::Module) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:83
[3] @tullio(::LineNumberNode, ::Module, ::Vararg{Any,N} where N) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:53
[4] eval(::Module, ::Any) at ./boot.jl:331
[5] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:132
[6] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
[7] top-level scope at none:1
in expression starting at REPL[37]:1
OK, great, had not looked hard at the expanded version. The alternative access like this gives immediate errors:
julia> @macroexpand @avx for i = 📏i
for j = 📏j
ℛℰ𝒮 = (A[i] + 𝜀A) * log((B[i, j] + 𝜀B) / (C[j] + 𝜀C))
𝛥B[i, j] = 𝛥B[i, j] + ℛℰ𝒮.partials[1] * 𝛥ℛℰ𝒮[i]
end
end
ERROR: LoadError: TypeError: in typeassert, expected Symbol, got Expr
I think the error you get might be from master branch, which is last year's attempt, the right branch is coronavirusrewrite
... sorry should get around to merging it.
Ah, checking out coronaviruswrite
solves that problem.
Is
getfield
expected to work?
Not yet. PR's are welcome, otherwise I may get around to it, but probably not until after there's real support for non-hardware types.
On that note, what are A
, B
, and C
?
The same applies to the previous snippet I posted. If they're not arrays of hardware types, loading from them isn't supported yet either.
These are things you could implement, preferably in the way the README demos adding support for Complex{Float64}
. But you could also do it in the form of defining a load as a bunch of loads and shufflevector
s.
But I don't see where you're defining any such methods at the moment.
Could you also come up with a more minimal example of what you're trying to do?
It's hard for me to see what's going on from the @tullio
macro.
I saw that it calls Main.💧82
, which calls Main.💥82
, and I stopped trying to dissect things from there.
A,B,C
are arrays of Float64, and functions apply!
/💥
are like mul!
in that they work in-place, while create
/💧
is like *
that first makes the necessary output array. (I got tired of looking at gensyms!) Then ∇💧
is supposed to compute backwards gradients, and similarly calls in-place function ∇💥
, which has some loops. Inside these loops there's a scalar computation which can be done using dual numbers, and this is the problematic part.
There are just three dual numbers as input (the same ones at every step), never an array of them. And ordinary numbers partials(res,1)
are written back into arrays similar(A)
etc. I was hoping that by juggling these Dual{SVec}
things this inner "scalar" computation could be made efficient, these objects only have to exist for a few lines, and then get taken apart again.
I agree though that my example is not so minimal, sorry! I will have a go at boiling it down.
OK here's an example small enough just to paste in here:
using LoopVectorization, ForwardDiff, BenchmarkTools
function dualinv!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv!(rand(5), 1:5) == inv.(1:5)
function dualinv_avx!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv_avx!(rand(5), collect(1:5.0)) # UndefVarError: partials not defined
using ForwardDiff: partials
dualinv_avx!(rand(5), collect(1:5.0)) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::ForwardDiff.Dual{Nothing,Float64,2})
using Tullio # branch coronavirusrewrite, just for Dual Svec definitions
dualinv_avx!(rand(5), collect(1:5.0)) == inv.(1:5)
A = rand(10^6); B = similar(A);
@btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)
@btime dualinv_avx!($B, $A); # 12.363 ms (750003 allocations: 49.59 MiB)
If you find basic SVec
definitions missing that sound like they should be there, please file an issue (or pull request) at SIMDPirates or SLEEFPirates. SIMDPirates is full of llvmcall
s, and SLEEFPirates has more elaborate definitions for special functions like exp
and sin
. Don't worry too much about which if it isn't immediately obvious to you.
This is for two reasons: 1) So everyone using LoopVectorization, or SIMDPirates/SLEEFPirates will have the definition available. 2) So I can make sure the implementation is efficient.
In the case of inv(::SVec{<:Any,<:Integer})
, I simply forgot to define inv(v::SVec{<:Any,<:Integer}) = vinv(v)
.
Note the difference in the code_native
and resulting performance between SIMDPirates.vinv
and Tullio.inv
:
julia> using Tullio, SIMDPirates, BenchmarkTools
julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>
julia> @code_native debuginfo=:none SIMDPirates.vinv(xi)
.text
movq %rdi, %rax
vcvtqq2pd (%rsi), %zmm0
movabsq $.rodata.cst8, %rcx
vbroadcastsd (%rcx), %zmm1
vdivpd %zmm0, %zmm1, %zmm0
vmovapd %zmm0, (%rdi)
vzeroupper
retq
nopl (%rax)
julia> @code_native debuginfo=:none inv(xi)
.text
pushq %rbp
movq %rsp, %rbp
pushq %rbx
andq $-64, %rsp
subq $192, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %xmm0, 32(%rsp)
movq $0, 48(%rsp)
movq %fs:0, %rbx
movq $4, 32(%rsp)
movq -15712(%rbx), %rax
movq %rax, 40(%rsp)
leaq 32(%rsp), %rax
movq %rax, -15712(%rbx)
vmovups (%rdi), %zmm0
vmovaps %zmm0, 64(%rsp)
movabsq $ntuple, %rax
leaq 64(%rsp), %rdi
movl $8, %esi
vzeroupper
callq *%rax
movq %rax, 48(%rsp)
movq %rax, 56(%rsp)
movabsq $jl_apply_generic, %rax
movabsq $139793777923704, %rdi # imm = 0x7F2446799678
leaq 56(%rsp), %rsi
movl $1, %edx
callq *%rax
movq 40(%rsp), %rcx
movq %rcx, -15712(%rbx)
leaq -8(%rbp), %rsp
popq %rbx
popq %rbp
retq
nopw %cs:(%rax,%rax)
nopl (%rax,%rax)
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
3.740 ns (0 allocations: 0 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>
julia> @btime inv($(Ref(xi))[])
27.096 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>
OK, it would be great to fix things upstream. But do you just mean things in Base, or would SLEEFPirates be willing to depend on ForwardDiff to have these dual types?
Some of the same issues seem to arise for complex numbers, too. Here's a version of my example above, which similarly tries to use them at an intermediate step:
function cinv!(B,A)
dx = im/100_000
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = 100_000 * imag(res)
end
B
end
cinv!(rand(5), collect(1:5.0)) ≈ inv.(1:5)
function cinv_avx!(B,A)
dx = im/100_000
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = 100_000 * imag(res)
end
B
end
cinv_avx!(rand(5), collect(1:5.0)) ≈ inv.(1:5) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::Complex{Float64})
And, here's what happens with inv
on my computer (an i7-8700), @code_native
differs at length 8 but not length 4:
julia> xi = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @btime Tullio.inv($(Ref(xi))[])
2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>
julia> @btime Tullio.inv($(Ref(xi))[])
28.119 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
Segmentation fault: 11
On the local versions (I plan to push by the end of the day):
julia> using LoopVectorization, ForwardDiff, BenchmarkTools
julia> using ForwardDiff: Dual, Partials
julia> using LoopVectorization: SVec
julia> ForwardDiff.can_dual(::Type{<:SVec}) = true
julia> @inline function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T<:Number,D,N,S}
y = x.value + sv
ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D)) # avoiding partials(x.d) didn't help
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:+(sv::SVec{N,S}, x::Dual{Z,T,D}) where {Z,T<:Number,D,N,S}
y = x.value + sv
ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:*(x::Dual{Z,SVec{N,T},D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value * sv
ps = ntuple(d -> x.partials.values[d] * sv, Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end
julia> @inline function Base.:*(sv::SVec{N,S}, x::Dual{Z,SVec{N,T},D}) where {Z,T,D,N,S}
y = sv * x.value
ps = ntuple(d -> sv * x.partials.values[d], Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:*(p::Partials{D,SVec{N,T}}, sv::SVec{N,S}) where {T,D,N,S}
TS = SVec{N,promote_type(T,S)}
Partials{D,TS}(ntuple(d -> p.values[d] * sv, Val(D)))
end
julia> @inline function Base.:*(sv::SVec{N,S}, p::Partials{D,SVec{N,T}}) where {T,D,N,S}
TS = SVec{N,promote_type(T,S)}
Partials{D,TS}(ntuple(d -> sv * p.values[d], Val(D)))
end
julia> function dualinv!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv! (generic function with 1 method)
julia> dualinv!(rand(5), 1:5) == inv.(1:5)
true
julia> function dualinv_avx!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv_avx! (generic function with 1 method)
julia> dualinv_avx!(rand(5), collect(1:5.0))
5-element Array{Float64,1}:
1.0
0.5
0.3333333333333333
0.25
0.2
julia> dualinv_avx!(rand(5), 1:5.0)
5-element Array{Float64,1}:
1.0
0.5
0.3333333333333333
0.25
0.2
julia> A = rand(10^6); B = similar(A);
julia> @btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)
5.684 ms (0 allocations: 0 bytes)
julia> @btime dualinv_avx!($B, $A);
662.525 μs (0 allocations: 0 bytes)
Note that two of your Base.:*
methods had a Z
in the where clause that didn't show up in the argument types.
I wish that'd throw an error. Instead, it silently causes your code to allocate and run slower.
And, here's what happens with inv on my computer (an i7-8700), @code_native differs at length 8 but not length 4:
Interesting. I see the same thing you do:
julia> using Tullio, SIMDPirates, BenchmarkTools
julia> xi4 = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>
julia> inv(xi4)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @which inv(xi4)
inv(sv::SVec{N,var"#s24"} where var"#s24"<:Integer) where N in Tullio at /home/chriselrod/.julia/packages/Tullio/kCDFo/src/forward.jl:97
julia> @btime inv($(Ref(xi4))[])
1.876 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @btime SIMDPirates.vinv($(Ref(xi4))[])
1.879 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> versioninfo()
Julia Version 1.5.0-DEV.645
Commit 0e23ff2581* (2020-04-15 17:02 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Maybe there's a limit causing the compiler to give up somewhere between 4 and 8.
julia> @btime SIMDPirates.vinv($(Ref(xi))[]) Segmentation fault: 11
That should be fixed on Julia master.
On older versions of Julia, it only occurs when an SVec
(or any struct containing an NTuple{N,Core.VecElement{T}}
for N
s about which Julia and LLVM disagreed on their alignment) is returned from a function.
So something like
using SIMDPirates
xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
@btime extract_data(SIMDPirates.vinv($(Ref(xi))[]))
Might work.
Ah indeed, that crash was Julia 1.3, but it's fixed on 1.5.
And thanks, deleting that unused where Z
fixes this example completely. It's identical to a loop with B[i] = 1/A[i]
. (You did say but I didn't realise this was such a big deal!)
It's insidious! It's one of the first things I look for when I run into surprising amounts of allocations.
And thanks, deleting that unused where Z fixes this example completely. It's identical to a loop with
B[i] = 1/A[i]
.
Nice. I was expecting that, because Base.log(::SVec)
is inlined, so LLVM should be able to delete it once it sees the result is unused and that it log
has no side effects (which it can tell, thanks to the inlining).
Some other methods like Base.exp(::SVec)
aren't inlined on all architectures, so this makes me wonder if I should make that consistent. It can sometimes be important, like here.
(exp
isn't inlined on Linux with recent GLIBC, because it calls the GLIBC version with ccall
.)
OK, it would be great to fix things upstream. But do you just mean things in Base, or would SLEEFPirates be willing to depend on ForwardDiff to have these dual types?
Hmm. ForwardDiff is a heavy dependency.
julia> @time using LoopVectorization
0.396693 seconds (912.65 k allocations: 63.532 MiB)
julia> @time using ForwardDiff
2.400366 seconds (10.07 M allocations: 489.370 MiB, 8.63% gc time)
Maybe we can create a ForwardDiffSVec
or ForwardDiffLoopVectorization
library? Doesn't have to be those names.
It'd take me a while to create a new open source library (I'd have to file for approval, etc). But if you don't mind also depending on PaddedMatrices
, I could use PaddedMatricesForwardDiff (which will need updating).
Otherwise, you could create a library, and I can add to VectorizationBase, SIMDPirates, and SLEEFPirates' READMEs that it is the official/blessed library for ForwardDiff support on SVec
s.
Maybe to LoopVectorization's documentation as well, given that there's an example (like the complex number one) demonstrating how to use it.
Looks like https://github.com/JuliaLang/julia/issues/29393 is the issue about dangling type parameters.
Will fiddle a bit more first, but PaddedMatricesForwardDiff doesn't sound like a bad option to store these methods, since it already depends on all the relevant things.
Okay. You're welcome to make PRs there yourself. It may be a couple months before I work on it myself.
I was trying to use this with ForwardDiff, but not an array of structs like #19, just using dual numbers within a loop and then extracting ordinary numbers. Something like this:
I got this to run with
@avx
on the loop, by filling in whatever methods were missing, such as:However the result is slower than without, and has many more allocations.
I wonder whether this is expected to work, and whether adding methods like this is right thing to do? I can tidy up an example if there isn't an obvious fatal flaw here.