Open fatteneder opened 1 year ago
My suggested workaround is
function testy(N, MDM, rhs, fx)
NN = N^2
rng = 1:NN
v_rhs = reshape(view(rhs, rng), (N,N))
v_fx = reshape(view(fx, rng), (N,N))
x = 1.0
@turbo for j = 1:N
for i = 1:N
rhsij = 0.0
for k = 1:N
# rhsij += MDM[i,k] * v_fx[k+(N-1)*j]
rhsij += MDM[i,k] * v_fx[k,j]
end
v_rhs[i,j] = rhsij * x
end
end
end
I haven't tested it, but I assume doing rhsij * x
instead will fix it.
Similarly, defining a new variable should work too:
function testy(N, MDM, rhs, fx)
NN = N^2
rng = 1:NN
v_rhs = reshape(view(rhs, rng), (N,N))
v_fx = reshape(view(fx, rng), (N,N))
x = 1.0
@turbo for j = 1:N
for i = 1:N
rhsij = 0.0
for k = 1:N
# rhsij += MDM[i,k] * v_fx[k+(N-1)*j]
rhsij += MDM[i,k] * v_fx[k,j]
end
rhsijx = rhsij * x
v_rhs[i,j] = rhsijx
end
end
end
Indeed, both versions work.
Is this a user problem, a current limitation of LV or just a bug?
I'd say it is just a bug, but I am personally unlikely to spend any time fixing it. My goal is still to replace LoopVectorization.jl eventually with a library that doesn't have this bug, but it is a slow process.
Its fine with me, since your suggested workaround does the job quite well.
Btw I encountered another issue when extending the above MWE by a second inner loop. Consider
using LoopVectorization
function testy(N, MDM, rhs, fx)
NN = N^2
rng = 1:NN
v_rhs = reshape(view(rhs, rng), (N,N))
v_fx = reshape(view(fx, rng), (N,N))
x = 1.0
@turbo for j = 1:N
for i = 1:N
rhsij = 0.0
for k = 1:N
rhsij += MDM[i,k] * v_fx[k,j]
end
for l = 1:N # changing this to k gives silently wrong results
rhsij += MDM[i,l] * v_fx[l,j]
end
v_rhs[i,j] = rhsij * x
end
end
end
N = 20
MDM = reshape(Float64.(collect(1:N^2)), (N,N))
rhs = deepcopy(MDM)
fx = deepcopy(MDM)
testy(N, MDM, rhs, fx)
display(sum(rhs))
Depending on whether you use k
for the index of both both inner loops or k
and l
for each the result differs.
I think the problem is independent of whether the bodies of the two loops agree, because in my real application I got my code working again after introducing the l
and the loop bodies are different there.
I wonder if this falls under this limitation from the README?
We expect that any time you use the @turbo macro with a given block of code that you:
...
-4. Are not using multiple loops at the same level in nested loops.
Btw I encountered another issue when extending the above MWE by a second inner loop. Consider
LoopVectorization doesn't support that. LoopModels will.
LoopVectorization.jl requires a loop chain from outer to inner, i.e. no trees. LoopModels supports trees and forests.
So I shouldn't rely on the code to work, although when using different indices k,l
, it gives me the correct results (for the values I tested)?
So I shouldn't rely on the code to work, although when using different indices
k,l
, it gives me the correct results (for the values I tested)?
I am surprised it happened to work at all. Maybe if you try different (larger) sizes, it'll give wrong results? Without checking, I'd have naively expected it to nest the 4 loops. It really should throw on that.
Maybe if you try different (larger) sizes, it'll give wrong results?
Indeed, for N >= 25
it starts to give wrong results compared to an implementation without @turbo
, interesting.
Luckily, in my application I can split the extra loop off into a separate mul!
and still get some benefit from using @turbo
.
Indeed, for N >= 25 it starts to give wrong results compared to an implementation without @turbo, interesting.
I was thinking it nested all the loops, meaning it made a chain of loops 4 deep, instead of a tree with 2 leaves. But that it managed to produce correct results anyway, due to unrolling aggressively enough that at least one loop didn't repeat when the trip count was low enough, so you happened to get correct answers.
Firstly, many thanks for this amazing package!
I managed to cook it down to the following MWE
gives
Changing one of the following makes it work though
@turbo
(obviously)@turbo
onto thei
loopN
to 7 or smallerrhsij += MDM[i,k] * v_fx[k+(N-1)*j]
rhsij *= x
from within thei
loopMy setup is
and using
LoopVectorization v0.12.165
.