JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
742 stars 66 forks source link

CartesianIndex and @turbo loop #359

Open klowrey opened 2 years ago

klowrey commented 2 years ago

LV has been awesome for effortlessly speeding up code! Thanks! I've recently run into some errors I don't know enough about to fix, reproduced with errors below with version v0.12.96. They're modifications of the image filtering example code. Currently the errors seem a bit too specific to me to address, and was hoping for help, or to report a bug if there's one.

In short my project involves kernels but over sparse, specific points rather than every index of a matrix so only convolving the kernel at specific CartesianIndex's seems elegant. The outerr issue seems to take offence with making the CartesianIndex itself, while the inerr issue is beyond me but may be related; maybe @turbo doesn't like the CartesianIndex being dynamically created? The nominal image filtering example works without problems. In a previous version of LV, I had a block of code similar to the inerr function below working.

using LoopVectorization, OffsetArrays

x = rand(10,10)
kern = Images.Kernel.gaussian((1, 1), (3, 3))

function outerr(x, kern)
    r = size(x,1)
    s = zero(eltype(x))
    @turbo for i=2:r-1
        idx = CartesianIndex(i,i)
        for K in CartesianIndices(kern)
            s += x[idx+K] + kern[K]
        end
    end
    s
end

function inerr(x, kern)
    r = size(x,1)
    s = zero(eltype(x))
    for i=2:r-1
        idx = CartesianIndex(i,i)
        @turbo for K in CartesianIndices(kern)
            s += x[idx+K] + kern[K]
        end
    end
    s
end

julia> outerr(x, kern)
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
 [1] fieldcount(t::Any)
   @ Base ./reflection.jl:764
 [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, #unused#::Type{TypeVar})
   @ LoopVectorization ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:12
 [3] _append_fields!(t::Expr, body::Expr, sym::Symbol, #unused#::Type{UnionAll}) (repeats 3 times)
   @ LoopVectorization ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:19
 [4] #s166#66
   @ ~/.julia/packages/LoopVectorization/umAIq/src/condense_loopset.jl:34 [inlined]
 [5] var"#s166#66"(T::Any, ::Any, r::Any)
   @ LoopVectorization ./none:0
 [6] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
   @ Core ./boot.jl:580
 [7] outerr(x::Matrix{Float64}, kern::OffsetMatrix{Float64, Matrix{Float64}})
   @ Main /tmp/lv.jl:11
 [8] top-level scope
   @ REPL[4]:1

julia> inerr(x, kern)
ERROR: MethodError: no method matching _gesp(::LayoutPointers.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, ::Static.StaticInt{1}, ::CartesianIndex{2}, ::Static.StaticInt{1})
Closest candidates are:
  _gesp(::LayoutPointers.FastRange, ::Static.StaticInt{1}, ::Any, ::Static.StaticInt{1}) at ~/.julia/packages/LoopVectorization/umAIq/src/vectorizationbase_compat/subsetview.jl:38
  _gesp(::LayoutPointers.AbstractStridedPointer{T, N, C, B, R, X, O} where {C, B, R, X<:Tuple{Vararg{Integer, N}}, O<:Tuple{Vararg{Integer, N}}}, ::Static.StaticInt{I}, ::Integer, ::Static.StaticInt{D}) where {I, N, T, D} at ~/.julia/packages/LoopVectorization/umAIq/src/vectorizationbase_compat/subsetview.jl:39
Stacktrace:
 [1] inerr(x::Matrix{Float64}, kern::OffsetMatrix{Float64, Matrix{Float64}})
   @ Main /tmp/lv.jl:25
 [2] top-level scope
   @ REPL[5]:
chriselrod commented 2 years ago

maybe @turbo doesn't like the CartesianIndex being dynamically created?

Yes. @turbo actually doesn't like CartesianIndex at all. It has special handling for CartesianIndices, where it turns CartesianIndices{N} into N separate loops, thereby avoiding CartesianIndex altogether.

If someone is interested in fixing this, I can provide instructions/guidance.

Otherwise, this will have to wait for the LoopVectorization rewrite, which I'm trying to prioritize at the moment. Thus, I wouldn't expect it be solved for the next half year+. I'm expecting it to be a while.

klowrey commented 2 years ago

Without using CartesianIndices, we can opt for loops instead. I experimented with summing at different levels in the stack of for loops and came across what might be a bug in the ininsum function below. The major differences between the functions are where sums are accumulated, with ininsum accumulating at two different levels which results in a total (outer loop) sum of 0.0 when used with @turbo. The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.

using LoopVectorization, OffsetArrays, BenchmarkTools

x = rand(100,100)
kern = Images.Kernel.gaussian((1, 1), (3, 3))

function noturbosum(x, kern)
    ks = zero(eltype(x))
    @inbounds for i=2:size(x,1)-1
        for j in axes(kern,1), m in axes(kern,2)
            ks += x[i+j, i+m] * kern[j,m]
        end
    end
    ks
end

function outersum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        for j in axes(kern,1), m in axes(kern,2)
            ks += x[i+j, i+m] * kern[j,m]
        end
    end
    ks
end

function insum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        s1 = zero(eltype(x))
        for j in axes(kern,1), m in axes(kern,2)
            s1 += x[i+j, i+m] * kern[j,m]
        end
        ks += s1
    end
    ks
end

function in2sum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        for j in axes(kern,1)
            s0 = zero(eltype(x))
            for m in axes(kern,2)
                s0 += x[i+j, i+m] * kern[j,m]
            end
            ks += s0
        end
    end
    ks
end

function ininsum(x, kern)
    ks = zero(eltype(x))
    @turbo for i=2:size(x,1)-1
        s1 = zero(eltype(x))
        for j in axes(kern,1)
            s0 = zero(eltype(x))
            for m in axes(kern,2)
                s0 += x[i+j, i+m] * kern[j,m]
            end
            s1 += s0
        end
        ks += s1
    end
    ks
end

println("outer ", noturbosum(x, kern))
println("outer ", outersum(x, kern))
println("in    ", insum(x, kern))
println("in2   ", in2sum(x, kern))
println("inin  ", ininsum(x, kern))

display(@btime outersum($x, $kern))
display(@btime insum($x, $kern))
display(@btime in2sum($x, $kern))
display(@btime ininsum($x, $kern))

noturbo 49.06428455119616
outer     46.612923559459894
in          46.447873809237635
in2        46.447873809237635
inin         0.0
  3.379 μs (0 allocations: 0 bytes)
46.612923559459894
  302.369 ns (0 allocations: 0 bytes)
46.447873809237635
  168.696 ns (0 allocations: 0 bytes)
46.447873809237635
  23.303 ns (0 allocations: 0 bytes)
0.0

I probably can't dedicate time to fixing CartesianIndices but would love instructions/guidance on doing SIMD with dual numbers (I noticed the https://github.com/JuliaSIMD/SIMDDualNumbers.jl package and was trying to grok it). For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!

chriselrod commented 2 years ago

The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes.

Differences that big are definitely bugs. The bug in the first three was straightforward to fix, so I did. inin is trickier, so I'll also leave that to the rewrite. Adding @fastmath to noturbosum, I get:

julia> display(@btime noturbosum_nofm($x, $kern)) # no `@fastmath`, no `@turbo`
  965.667 ns (0 allocations: 0 bytes)
225.36596301915785

julia> display(@btime noturbosum($x, $kern)) # @fastmath, no @turbo
  991.838 ns (0 allocations: 0 bytes)
225.36596301915785

julia> display(@btime outersum($x, $kern))
  224.695 ns (0 allocations: 0 bytes)
225.365963019158

julia> display(@btime insum($x, $kern))
  378.662 ns (0 allocations: 0 bytes)
225.36596301915802

julia> display(@btime in2sum($x, $kern))
  273.545 ns (0 allocations: 0 bytes)
225.36596301915804

julia> display(@btime ininsum($x, $kern))
  125.409 ns (6 allocations: 304 bytes)
0.0

For the functions I'm writing, I've found that manually using Dual numbers from FowardDiff.jl rather than going through their gradient/derivative functions interface is much more performant, but I'm not sure how to use them with SIMD/LV, etc., and would love guidance!

You should be able to solve multiple ForwardDiff.gradients in parallel. You can see some examples here: https://discourse.julialang.org/t/making-a-fast-evaluation-of-a-function-and-its-gradient/68884

Something I'd like to get around to eventually is supporting using VectorizationBase.Vecs for dual number partials to make sure they SIMD.

svretina commented 9 months ago

sorry to revive this, is it possible to give a hint what is causing the first error message?

ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace: [etc]

I have a function with a single for loop which I want to call in a nested way:

@inline function integrate_avx(args) 
    @turbo for i in 1:length(x)
        # stuff
        s += w[i] * (f(xm) + f(xp)) # a reduction
    end
    return s
end

then I want to write my 2D function based on the 1D, namely:

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
        return res
    end
    g2(x1::T) where {T} = f1(x1)
    res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
    return res
end

This gives me

ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
  [1] fieldcount(t::Any)
    @ Base ./reflection.jl:895
  [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:12
  [3] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{2, Float64}, SVector{2, Float64}, Vector{Float64}, Vector{Float64}, Float64}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [4] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{ScalarWave.TanhSinhQuadrature.var"#g2#106"{ScalarWave.TanhSinhQuadrature.var"#f1#104"{var"#77#78", SVector{…}, SVector{…}, Vector{…}, Vector{…}, Float64}}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [5] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{LayoutPointers.GroupedStridedPointers{…}, Float64, Float64, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, ScalarWave.TanhSinhQuadrature.var"#g2#106"{…}, DataType}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [6] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Tuple{Tuple{…}, Tuple{…}}})
    @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:19
  [7] #s223#66
    @ ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:34 [inlined]
  [8] var"#s223#66"(T::Any, ::Any, r::Any)
    @ LoopVectorization ./none:0
  [9] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
    @ Core ./boot.jl:602
 [10] integrate_avx
    @ ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:143 [inlined]
 [11] integrate_avx(f::var"#77#78", xmin::SVector{2, Float64}, xmax::SVector{2, Float64}, x::Vector{Float64}, w::Vector{Float64}, h::Float64)
    @ ScalarWave.TanhSinhQuadrature ~/Codes/PhD/ScalarWave/src/TanhSinhQuadrature.jl:173
 [12] top-level scope
    @ REPL[231]:1
 [13] top-level scope
    @ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1428
Some type information was truncated. Use `show(err)` to see complete types.
chriselrod commented 9 months ago
ERROR: ArgumentError: type does not have a definite number of fields
Stacktrace:
  [1] fieldcount(t::Any)
    @ Base ./reflection.jl:895
  [2] _append_fields!(t::Expr, body::Expr, sym::Symbol, ::Type{Core.Box})

You have a Core.Box in your closure. Get rid of it.

svretina commented 9 months ago

thanks for your answer! Indeed writing the closure a bit differently eliminates the above error. But the "outer" call again is not suitable for simd ?

@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h) # <- this doesn't pass the check_args
end
┌ Warning: #= /home/svretina/Codes/PhD/FastTanhSinhQuadrature.jl/src/FastTanhSinhQuadrature.jl:158 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ FastTanhSinhQuadrature ~/.julia/packages/LoopVectorization/7gWfp/src/condense_loopset.jl:1148
chriselrod commented 9 months ago

As one aside

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        res = integrate_avx(g1, xmin[2], xmax[2], x, w, h)
        return res
    end
    g2(x1::T) where {T} = f1(x1)
    res = integrate_avx(g2, xmin[1], xmax[1], x, w, h)
    return res
end

looks like it is going to be putting res in a Core.Box. I would simplify to

function integrate_avx(f::Function, xmin::SVector{2,T}, xmax::SVector{2,T},
                       x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real}
    function f1(x1::T) where {T<:Real}
        g1(y::T) where {T} = f(x1, y)
        return integrate_avx(g1, xmin[2], xmax[2], x, w, h)
    end
    g2(x1::T) where {T} = f1(x1)
    return integrate_avx(g2, xmin[1], xmax[1], x, w, h)
end

to avoid that.

What are the types? Do you have a reproducer I can copy/paste?

You could try deving the package to modify the check functions to print the types when returning false, so you know what the particular problematic example was. https://github.com/JuliaSIMD/LoopVectorization.jl/blob/3fbe248d3d7c2d4fc7c55dd85890b8cfec83be1d/src/condense_loopset.jl#L933-L977

svretina commented 9 months ago

The code is from here, but here is a MWE:

using StaticArrays
using LoopVectorization

@inline function ordinate(t::T) where {T<:Real}
    return @fastmath tanh(T(π) / 2 * sinh(t))
end
@inline function weight(t::T) where {T<:Real}
    @fastmath tmp = cosh(T(π) / 2 * sinh(t))
    return @fastmath ((T(π) / 2) * cosh(t)) / (tmp * tmp)
end

@inline function inv_ordinate(t::T) where {T<:Real}
    return @fastmath asinh(log((one(T) + t) / (one(T) - t)) / T(π))
end

function tanhsinh(::Type{T}, n::Int) where {T<:AbstractFloat}
    tmax = inv_ordinate(prevfloat(one(T)))
    h = tmax / n
    t = h:h:tmax
    x = ordinate.(t)
    w = weight.(t)
    N = length(x)
    if n < 100
        return SVector{N,T}(x), SVector{N,T}(w), h
    else
        return x, w, h
    end
end

# no simd 
@inline function integrate(f::S, xmin::T, xmax::T, x::AbstractVector{T},
    w::AbstractVector{T}, h::T) where {T<:Real,S}
    @fastmath @inbounds begin
        Δx = (xmax - xmin) / 2
        x₀ = (xmax + xmin) / 2
        s = weight(zero(T)) * f(x₀)
        for i in 1:length(x)
            xp = x₀ + Δx * x[i]
            xm = x₀ - Δx * x[i]
            if xm > xmin
                s += w[i] * f(xm)
            end
            if xp < xmax
                s += w[i] * f(xp)
            end
        end
    end
    return Δx * h * s
end

# function using loopvectorization 
@inline function integrate_avx(f::S, xmin::T, xmax::T, x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @fastmath Δx = (xmax - xmin) / 2
    @fastmath x₀ = (xmax + xmin) / 2
    @fastmath s = weight(zero(T)) * f(x₀)
    @turbo for i in 1:length(x)
        Δxxi = Δx * x[i]
        xp = x₀ + Δxxi
        xm = x₀ - Δxxi
        s += w[i] * (f(xm) + f(xp))
    end
    return @fastmath Δx * h * s
end

## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return @inbounds integrate(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end

x,w,h = tanhsinh(Float64, 40)
f(x,y) = x*y
lower = SVector{2,Float64}(0.0,0.0)
upper = SVector{2, Float64}(1.0,1.0)
integrate(f, lower, upper, x,w,h) # 0.2499999999999992
chriselrod commented 9 months ago

It works for me. I don't get the warning, and I see @turbo is getting used.

svretina commented 9 months ago

did you change the second call to

## 2D
# here I only use the avx version for the inner call
# if you call integrate_avx in the outter call ( in the return statement )
# the warning occurs
@inline function integrate(f::S, xmin::SVector{2,T}, xmax::SVector{2,T},
    x::AbstractVector{T}, w::AbstractVector{T}, h::T) where {T<:Real,S}
    @inline function f2(x1::T) where {T<:Real}
        return @inbounds integrate_avx(y -> f(x1, y), xmin[2], xmax[2], x, w, h)
    end
    return @inbounds integrate_avx(x1 -> f2(x1), xmin[1], xmax[1], x, w, h)
end
chriselrod commented 9 months ago

No, doing that, and I do see the warning. You can't nest @turbos. Only one layer can SIMD. You could write out loops manually and place an @turbo on the outermost loop, and it'll decide what to do with all the loops.

svretina commented 9 months ago

thanks a lot for the help!