Open baggepinnen opened 4 years ago
A little odd that removing the line c += d*(exp(-d/β + lu[I] + lv[J]))
makes the error go away.
The problematic lines are:
t = (I-J).I
d = (t[1]^2+t[2]^2)
It works if I remove them:
julia> function ot_cost1(u,K,v,β=1)
c = 0.0
lu = log.(u)
lv = log.(v); d = 1.0
@avx for I in CartesianIndices(u)
for J in CartesianIndices(v)
# t = (I-J).I
# d = (t[1]^2+t[2]^2)
c += d*(exp(-d/β + lu[I] + lv[J]))
end
end
c
end
ot_cost1 (generic function with 2 methods)
julia> u = rand(100,100); v = rand(100,100);
julia> ot_cost1(u, nothing, v)
9.319208900261257e6
julia> function ot_cost2(u,K,v,β=1)
c = 0.0
lu = log.(u)
lv = log.(v); d = 1.0
for I in CartesianIndices(u)
for J in CartesianIndices(v)
# t = (I-J).I
# d = (t[1]^2+t[2]^2)
c += d*(exp(-d/β + lu[I] + lv[J]))
end
end
c
end
ot_cost2 (generic function with 2 methods)
julia> ot_cost2(u, nothing, v) # no @avx
9.319208900262414e6
julia> ot_cost1(u, nothing, v) ≈ ans
true
The reason these lines are problematic is because @avx
expands CartesianIndices{N}
into N
loops, and I
and J
get replaced with multiple loops.
Every operation aside from indexing will also be expanded. So for example,
I - J
turns into
op1 = I_1 - J_1
op2 = I_2 - J_2
They stay split until they show up in an index.
Meaning, if I and J are CartesianIndices{2}
, this:
B[I] = exp(A[I + J + 3])
would turn into
op1 = I_1 + J_1
op2 = I_2 + J_2
op3 = op1 + 3
op4 = op2 + 3
B[I_1, I_2] = exp(A[op3, op4])
So (without modifying LoopVectorization), an awful hack to work around this would be to define an AbstractRange
type where vload
(which getindex
gets translated into) does what we want.
For now, it has to be an AbstractRange
, since these are passed through without it trying to get a pointer to them.
using SIMDPirates
struct SumAbs2 <: AbstractRange{Float64} end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:Integer})
i1 = Float64(I[1]); vabs2(i1)
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:_MM{W}}) where {W}
vabs2(vconvert(SVec{W,Float64}, SIMDPirates.svrange(I[1])))
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:Integer,Vararg})
i1 = Float64(I[1]); vabs2(i1) + vload(SumAbs2(), Base.tail(I))
end
@inline function SIMDPirates.vload(::SumAbs2, I::Tuple{<:_MM{W},Vararg}) where {W}
vadd(vabs2(vconvert(SVec{W,Float64}, SIMDPirates.svrange(I[1]))), vload(SumAbs2(), Base.tail(I)))
end
@inline SIMDPirates.vload(::SumAbs2, I::Tuple, ::Any) = vload(SumAbs2(), I)
I convert to floating point immediately because the conversion has to happen sometime, and many floating point operations (multiplication in particular) are much faster. Float64
is exact up to 2^53, so you're probably not going to lose any accuracy.
Now, on LoopVectorization 0.6.25:
function ot_cost(u,K,v,β=1)
c = 0.0
lu = log.(u)
lv = log.(v);
sa2 = SumAbs2()
@avx for I in CartesianIndices(u)
for J in CartesianIndices(v)
d = sa2[I - J]
c += d*(exp(-d/β + lu[I] + lv[J]))
end
end
c
end
function ot_cost_noavx(u,K,v,β=1)
c = 0.0
lu = log.(u)
lv = log.(v)
for I in CartesianIndices(u)
for J in CartesianIndices(v)
t = (I-J).I
d = (t[1]^2+t[2]^2)
c += d*(exp(-d/β + lu[I] + lv[J]))
end
end
c
end
this yields:
julia> u = rand(100,100); v = rand(100,100);
julia> ot_cost_noavx(u, nothing, v) ≈ ot_cost(u, nothing, v)
true
Howver, there's a lot of performance left on the table.
When I initially benchmarked this, the @avx
version was slower on my computer. That was because on Linux, SLEEFPirates tries to use GLIBC's exp
implementation (in libmvec.so
). This implementation is normally very fast, but it slows to a crawl with tiny (subnormal?) inputs. If you're on a modern Linux system with a recent version of GLIBC, you can specifically avoid using their exp
implementation and use an llvmcall
implementation instead via:
using SIMDPirates: vexp
function ot_cost(u,K,v,β=1)
c = 0.0
lu = log.(u)
lv = log.(v); d = 1.0
sa2 = SumAbs2()
@avx for I in CartesianIndices(u)
for J in CartesianIndices(v)
d = sa2[I - J]
c += d*(vexp(-d/β + lu[I] + lv[J]))
end
end
c
end
Base.set_zero_subnormals(true)
I also set_zero_subnormals(true)
above. Now I get (with AVX512):
julia> @btime ot_cost($u, nothing, $v)
104.007 ms (4 allocations: 156.41 KiB)
7616.182853919283
julia> @btime ot_cost_noavx($u, nothing, $v)
746.496 ms (4 allocations: 156.41 KiB)
7616.182853915762
If you're curious which version is more accurate, it seems the@avx
version is closer here:
julia> ot_cost_noavx(big.(u), nothing, big.(v))
7616.182853920148494020607720345210078382764313274363557253094398782082383279237
julia> Float64(ans)
7616.182853920149
Another low-hanging fruit is that we can add @avx
to the log
calls:
julia> function ot_cost(u,K,v,β=1)
c = 0.0
lu = @avx log.(u)
lv = @avx log.(v)
sa2 = SumAbs2()
@avx for I in CartesianIndices(u)
for J in CartesianIndices(v)
d = sa2[I - J]
c += d*(vexp(-d/β + lu[I] + lv[J]))
end
end
c
end
ot_cost (generic function with 2 methods)
julia> @btime ot_cost($u, nothing, $v)
99.545 ms (4 allocations: 156.41 KiB)
7616.182853919284
That didn't do much, because we're taking the log of 100^2 * 2 numbers, versus the main loop which covers 100^4 iterations. A last idea to speed it up is that multiplication is faster than division:
julia> @btime ot_cost($u, nothing, $v)
100.655 ms (4 allocations: 156.41 KiB)
7616.182853919284
Of course, division is still a lot faster than exp
, so I guess that didn't matter.
Have any suggestions on a better API for working with CartesianIndices directly?
The ideal would be that anything without @avx
also works with it, maybe we can come up with something simple.
I can't figure out what's wrong with the expression below. I've tried replacing
CartesianIndices
with eachindex but the error remainsthe error appears to come from the line
c += d*(exp(-d/β + lu[I] + lv[J]))
and it goes away if that line is removed. Full error