AssertionError: length(exargs) == 2 for simple loop

A little odd that removing the line c += d*(exp(-d/β + lu[I] + lv[J])) makes the error go away. The problematic lines are:

            t = (I-J).I
            d = (t[1]^2+t[2]^2)

It works if I remove them:

julia> function ot_cost1(u,K,v,β=1)
           c = 0.0
           lu = log.(u)
           lv = log.(v); d = 1.0
           @avx for I in CartesianIndices(u)
               for J in CartesianIndices(v)
       #            t = (I-J).I
       #            d = (t[1]^2+t[2]^2)
                   c += d*(exp(-d/β + lu[I] + lv[J]))
               end
           end
           c
       end
ot_cost1 (generic function with 2 methods)

julia> u = rand(100,100); v = rand(100,100);

julia> ot_cost1(u, nothing, v)
9.319208900261257e6

julia> function ot_cost2(u,K,v,β=1)
           c = 0.0
           lu = log.(u)
           lv = log.(v); d = 1.0
           for I in CartesianIndices(u)
               for J in CartesianIndices(v)
       #            t = (I-J).I
       #            d = (t[1]^2+t[2]^2)
                   c += d*(exp(-d/β + lu[I] + lv[J]))
               end
           end
           c
       end
ot_cost2 (generic function with 2 methods)

julia> ot_cost2(u, nothing, v) # no @avx
9.319208900262414e6

julia> ot_cost1(u, nothing, v) ≈ ans
true

The reason these lines are problematic is because @avx expands CartesianIndices{N} into N loops, and I and J get replaced with multiple loops. Every operation aside from indexing will also be expanded. So for example, I - J turns into

op1 = I_1 - J_1
op2 = I_2 - J_2

They stay split until they show up in an index. Meaning, if I and J are CartesianIndices{2}, this:

B[I] = exp(A[I + J + 3])

would turn into

op1 = I_1 + J_1
op2 = I_2 + J_2
op3 = op1 + 3
op4 = op2 + 3
B[I_1, I_2] = exp(A[op3, op4])

So (without modifying LoopVectorization), an awful hack to work around this would be to define an AbstractRange type where vload (which getindex gets translated into) does what we want. For now, it has to be an AbstractRange, since these are passed through without it trying to get a pointer to them.

using SIMDPirates
struct SumAbs2 <: AbstractRange{Float64} end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:Integer})
    i1 = Float64(I[1]); vabs2(i1)
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:_MM{W}}) where {W}
    vabs2(vconvert(SVec{W,Float64}, SIMDPirates.svrange(I[1])))
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:Integer,Vararg})
    i1 = Float64(I[1]); vabs2(i1) + vload(SumAbs2(), Base.tail(I))
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:_MM{W},Vararg}) where {W}
    vadd(vabs2(vconvert(SVec{W,Float64}, SIMDPirates.svrange(I[1]))), vload(SumAbs2(), Base.tail(I)))
end
@inline SIMDPirates.vload(::SumAbs2, I::Tuple, ::Any) = vload(SumAbs2(), I)

I convert to floating point immediately because the conversion has to happen sometime, and many floating point operations (multiplication in particular) are much faster. Float64 is exact up to 2^53, so you're probably not going to lose any accuracy.

Now, on LoopVectorization 0.6.25:

function ot_cost(u,K,v,β=1)
    c = 0.0
    lu = log.(u)
    lv = log.(v);
    sa2 = SumAbs2()
    @avx for I in CartesianIndices(u)
        for J in CartesianIndices(v)
            d = sa2[I - J]
            c += d*(exp(-d/β + lu[I] + lv[J]))
        end
    end
    c
end
function ot_cost_noavx(u,K,v,β=1)
    c = 0.0
    lu = log.(u)
    lv = log.(v)
    for I in CartesianIndices(u)
        for J in CartesianIndices(v)
            t = (I-J).I
            d = (t[1]^2+t[2]^2)
            c += d*(exp(-d/β + lu[I] + lv[J]))
        end
    end
    c
end

this yields:

julia> u = rand(100,100); v = rand(100,100);
julia> ot_cost_noavx(u, nothing, v) ≈ ot_cost(u, nothing, v)
true

Howver, there's a lot of performance left on the table. When I initially benchmarked this, the @avx version was slower on my computer. That was because on Linux, SLEEFPirates tries to use GLIBC's exp implementation (in libmvec.so). This implementation is normally very fast, but it slows to a crawl with tiny (subnormal?) inputs. If you're on a modern Linux system with a recent version of GLIBC, you can specifically avoid using their exp implementation and use an llvmcall implementation instead via:

using SIMDPirates: vexp
function ot_cost(u,K,v,β=1)
    c = 0.0
    lu = log.(u)
    lv = log.(v); d = 1.0
    sa2 = SumAbs2()
    @avx for I in CartesianIndices(u)
        for J in CartesianIndices(v)
            d = sa2[I - J]
            c += d*(vexp(-d/β + lu[I] + lv[J]))
        end
    end
    c
end
Base.set_zero_subnormals(true)

I also set_zero_subnormals(true) above. Now I get (with AVX512):

julia> @btime ot_cost($u, nothing, $v)
  104.007 ms (4 allocations: 156.41 KiB)
7616.182853919283

julia> @btime ot_cost_noavx($u, nothing, $v)
  746.496 ms (4 allocations: 156.41 KiB)
7616.182853915762

If you're curious which version is more accurate, it seems the@avx version is closer here:

julia> ot_cost_noavx(big.(u), nothing, big.(v))
7616.182853920148494020607720345210078382764313274363557253094398782082383279237

julia> Float64(ans)
7616.182853920149

Another low-hanging fruit is that we can add @avx to the log calls:

julia> function ot_cost(u,K,v,β=1)
           c = 0.0
           lu = @avx log.(u)
           lv = @avx log.(v)
           sa2 = SumAbs2()
           @avx for I in CartesianIndices(u)
               for J in CartesianIndices(v)
                   d = sa2[I - J]
                   c += d*(vexp(-d/β + lu[I] + lv[J]))
               end
           end
           c
       end
ot_cost (generic function with 2 methods)

julia> @btime ot_cost($u, nothing, $v)
  99.545 ms (4 allocations: 156.41 KiB)
7616.182853919284

That didn't do much, because we're taking the log of 100^2 * 2 numbers, versus the main loop which covers 100^4 iterations. A last idea to speed it up is that multiplication is faster than division:

julia> @btime ot_cost($u, nothing, $v)
  100.655 ms (4 allocations: 156.41 KiB)
7616.182853919284

Of course, division is still a lot faster than exp, so I guess that didn't matter.

Have any suggestions on a better API for working with CartesianIndices directly? The ideal would be that anything without @avx also works with it, maybe we can come up with something simple.

JuliaSIMD / LoopVectorization.jl

AssertionError: length(exargs) == 2 for simple loop #80