Open klowrey opened 2 years ago
maybe
@turbo
doesn't like the CartesianIndex being dynamically created?
Yes. @turbo
actually doesn't like CartesianIndex
at all.
It has special handling for CartesianIndices
, where it turns CartesianIndices{N}
into N
separate loops, thereby avoiding CartesianIndex
altogether.
If someone is interested in fixing this, I can provide instructions/guidance.
Otherwise, this will have to wait for the LoopVectorization rewrite, which I'm trying to prioritize at the moment. Thus, I wouldn't expect it be solved for the next half year+. I'm expecting it to be a while.
Without using CartesianIndices
, we can opt for loops instead. I experimented with summing at different levels in the stack of for loops and came across what might be a bug in the ininsum
function below. The major differences between the functions are where sums are accumulated, with ininsum
accumulating at two different levels which results in a total (outer loop) sum of 0.0 when used with @turbo
. The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.
using LoopVectorization, OffsetArrays, BenchmarkTools
x = rand(100,100)
kern = Images.Kernel.gaussian((1, 1), (3, 3))
function noturbosum(x, kern)
ks = zero(eltype(x))
@inbounds for i=2:size(x,1)-1
for j in axes(kern,1), m in axes(kern,2)
ks += x[i+j, i+m] * kern[j,m]
end
end
ks
end
function outersum(x, kern)
ks = zero(eltype(x))
@turbo for i=2:size(x,1)-1
for j in axes(kern,1), m in axes(kern,2)
ks += x[i+j, i+m] * kern[j,m]
end
end
ks
end
function insum(x, kern)
ks = zero(eltype(x))
@turbo for i=2:size(x,1)-1
s1 = zero(eltype(x))
for j in axes(kern,1), m in axes(kern,2)
s1 += x[i+j, i+m] * kern[j,m]
end
ks += s1
end
ks
end
function in2sum(x, kern)
ks = zero(eltype(x))
@turbo for i=2:size(x,1)-1
for j in axes(kern,1)
s0 = zero(eltype(x))
for m in axes(kern,2)
s0 += x[i+j, i+m] * kern[j,m]
end
ks += s0
end
end
ks
end
function ininsum(x, kern)
ks = zero(eltype(x))
@turbo for i=2:size(x,1)-1
s1 = zero(eltype(x))
for j in axes(kern,1)
s0 = zero(eltype(x))
for m in axes(kern,2)
s0 += x[i+j, i+m] * kern[j,m]
end
s1 += s0
end
ks += s1
end
ks
end
println("outer ", noturbosum(x, kern))
println("outer ", outersum(x, kern))
println("in ", insum(x, kern))
println("in2 ", in2sum(x, kern))
println("inin ", ininsum(x, kern))
display(@btime outersum($x, $kern))
display(@btime insum($x, $kern))
display(@btime in2sum($x, $kern))
display(@btime ininsum($x, $kern))
noturbo 49.06428455119616
outer 46.612923559459894
in 46.447873809237635
in2 46.447873809237635
inin 0.0
3.379 μs (0 allocations: 0 bytes)
46.612923559459894
302.369 ns (0 allocations: 0 bytes)
46.447873809237635
168.696 ns (0 allocations: 0 bytes)
46.447873809237635
23.303 ns (0 allocations: 0 bytes)
0.0
I probably can't dedicate time to fixing CartesianIndices
but would love instructions/guidance on doing SIMD with dual numbers (I noticed the https://github.com/JuliaSIMD/SIMDDualNumbers.jl package and was trying to grok it). For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!
The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.
Differences that big are definitely bugs.
The bug in the first three was straightforward to fix, so I did.
inin
is trickier, so I'll also leave that to the rewrite.
Adding @fastmath
to noturbosum
, I get:
julia> display(@btime noturbosum_nofm($x, $kern)) # no `@fastmath`, no `@turbo`
965.667 ns (0 allocations: 0 bytes)
225.36596301915785
julia> display(@btime noturbosum($x, $kern)) # @fastmath, no @turbo
991.838 ns (0 allocations: 0 bytes)
225.36596301915785
julia> display(@btime outersum($x, $kern))
224.695 ns (0 allocations: 0 bytes)
225.365963019158
julia> display(@btime insum($x, $kern))
378.662 ns (0 allocations: 0 bytes)
225.36596301915802
julia> display(@btime in2sum($x, $kern))
273.545 ns (0 allocations: 0 bytes)
225.36596301915804
julia> display(@btime ininsum($x, $kern))
125.409 ns (6 allocations: 304 bytes)
0.0
For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!
You should be able to solve multiple ForwardDiff.gradient
s in parallel. You can see some examples here:
https://discourse.julialang.org/t/making-a-fast-evaluation-of-a-function-and-its-gradient/68884
Something I'd like to get around to eventually is supporting using VectorizationBase.Vec
s for dual number partials to make sure they SIMD.
sorry to revive this, is it possible to give a hint what is causing the first error message?
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace: [etc]
I have a function with a single for loop which I want to call in a nested way:
@inline function integrate_avx(args)
@turbo for i in 1:length(x)
# stuff
s += w[i] * (f(xm) + f(xp)) # a reduction
end
return s
end
then I want to write my 2D function based on the 1D, namely:
function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
function f1(x1::T) where {T<:Real}
g1(y::T) where {T} = f(x1, y)
res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
return res
end
g2(x1::T) where {T} = f1(x1)
res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
return res
end
This gives me
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
[1] fieldcount(t::Any)
@ Base ./reflection.jl:895
[2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:12
[3] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{2, Float64}, SVector{2, Float64}, Vector{Float64}, Vector{Float64}, Float64}})
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
[4] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#g2#106"{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{…}, SVector{…}, Vector{…}, Vector{…}, Float64}}})
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
[5] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{LayoutPointers.GroupedStridedPointers{…}, Float64, Float64, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, DataType}})
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
[6] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{Tuple{…}, Tuple{…}}})
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
[7] #s223#66
@ ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:34 [inlined]
[8] var"#s223#66"(T::Any, ::Any, r::Any)
@ LoopVectorization ./none:0
[9] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
@ Core ./boot.jl:602
[10] integrate_avx
@ ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:143 [inlined]
[11] integrate_avx(f::var"#77#78", xmin::SVector{2, Float64}, xmax::SVector{2, Float64}, x::Vector{Float64}, w::Vector{Float64}, h::Float64)
@ ScalarWave.TanhSinhQuadrature ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:173
[12] top-level scope
@ REPL[231]:1
[13] top-level scope
@ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1428
Some type information was truncated. Use `show(err)` to see complete types.
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
[1] fieldcount(t::Any)
@ Base ./reflection.jl:895
[2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})
You have a Core.Box
in your closure. Get rid of it.
thanks for your answer! Indeed writing the closure a bit differently eliminates the above error. But the "outer" call again is not suitable for simd ?
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
@inline function f2(x1::T) where {T<:Real}
return integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
end
return integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h) # <- this doesn't pass the check_args
end
┌ Warning: #= /home/svretina/Codes/PhD/FastTanhSinhQuadrature.jl/src/FastTanhSinhQuadrature.jl:158 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ FastTanhSinhQuadrature ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:1148
As one aside
function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
function f1(x1::T) where {T<:Real}
g1(y::T) where {T} = f(x1, y)
res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
return res
end
g2(x1::T) where {T} = f1(x1)
res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
return res
end
looks like it is going to be putting res
in a Core.Box
.
I would simplify to
function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
function f1(x1::T) where {T<:Real}
g1(y::T) where {T} = f(x1, y)
return integrate_avx(g1, xmin[2], xmax[2], x, w, h)
end
g2(x1::T) where {T} = f1(x1)
return integrate_avx(g2, xmin[1], xmax[1], x, w, h)
end
to avoid that.
What are the types? Do you have a reproducer I can copy/paste?
You could try dev
ing the package to modify the check functions to print the types when returning false
, so you know what the particular problematic example was.
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/3fbe248d3d7c2d4fc7c55dd85890b8cfec83be1d/src/condense_loopset.jl#L933-L977
The code is from here, but here is a MWE:
using StaticArrays
using LoopVectorization
@inline function ordinate(t::T) where {T<:Real}
return @fastmath tanh(T(π) / 2 * sinh(t))
end
@inline function weight(t::T) where {T<:Real}
@fastmath tmp = cosh(T(π) / 2 * sinh(t))
return @fastmath ((T(π) / 2) * cosh(t)) / (tmp * tmp)
end
@inline function inv_ordinate(t::T) where {T<:Real}
return @fastmath asinh(log((one(T) + t) / (one(T) - t)) / T(π))
end
function tanhsinh(::Type{T}, n::Int) where {T<:AbstractFloat}
tmax = inv_ordinate(prevfloat(one(T)))
h = tmax / n
t = h:h:tmax
x = ordinate.(t)
w = weight.(t)
N = length(x)
if n < 100
return SVector{N,T}(x), SVector{N,T}(w), h
else
return x, w, h
end
end
# no simd
@inline function integrate(f::S, xmin::T, xmax::T, x::AbstractVector{T},
w::AbstractVector{T}, h::T) where {T<:Real,S}
@fastmath @inbounds begin
Δx = (xmax - xmin) / 2
x₀ = (xmax + xmin) / 2
s = weight(zero(T)) * f(x₀)
for i in 1:length(x)
xp = x₀ + Δx * x[i]
xm = x₀ - Δx * x[i]
if xm > xmin
s += w[i] * f(xm)
end
if xp < xmax
s += w[i] * f(xp)
end
end
end
return Δx * h * s
end
# function using loopvectorization
@inline function integrate_avx(f::S, xmin::T, xmax::T, x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
@fastmath Δx = (xmax - xmin) / 2
@fastmath x₀ = (xmax + xmin) / 2
@fastmath s = weight(zero(T)) * f(x₀)
@turbo for i in 1:length(x)
Δxxi = Δx * x[i]
xp = x₀ + Δxxi
xm = x₀ - Δxxi
s += w[i] * (f(xm) + f(xp))
end
return @fastmath Δx * h * s
end
## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
@inline function f2(x1::T) where {T<:Real}
return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
end
return @inbounds integrate(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end
x,w,h = tanhsinh(Float64, 40)
f(x,y) = x*y
lower = SVector{2,Float64}(0.0,0.0)
upper = SVector{2, Float64}(1.0,1.0)
integrate(f, lower, upper, x,w,h) # 0.2499999999999992
It works for me. I don't get the warning, and I see @turbo
is getting used.
did you change the second call to
## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
@inline function f2(x1::T) where {T<:Real}
return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
end
return @inbounds integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end
No, doing that, and I do see the warning.
You can't nest @turbo
s. Only one layer can SIMD.
You could write out loops manually and place an @turbo
on the outermost loop, and it'll decide what to do with all the loops.
thanks a lot for the help!
LV has been awesome for effortlessly speeding up code! Thanks! I've recently run into some errors I don't know enough about to fix, reproduced with errors below with version v0.12.96. They're modifications of the image filtering example code. Currently the errors seem a bit too specific to me to address, and was hoping for help, or to report a bug if there's one.
In short my project involves kernels but over sparse, specific points rather than every index of a matrix so only convolving the kernel at specific
CartesianIndex
's seems elegant. Theouterr
issue seems to take offence with making the CartesianIndex itself, while theinerr
issue is beyond me but may be related; maybe@turbo
doesn't like the CartesianIndex being dynamically created? The nominal image filtering example works without problems. In a previous version of LV, I had a block of code similar to theinerr
function below working.