JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
743 stars 66 forks source link

"Reduction not found." error with `ifelse` #243

Open SkyWorld117 opened 3 years ago

SkyWorld117 commented 3 years ago
julia> using LoopVectorization

julia> m = [1,2,3,4,5,6,7,8]
8-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6
 7
 8

julia> s = -Inf
-Inf

julia> @avx for i in eachindex(m)
           s = ifelse(m[i]>=s, m[i], s)
       end
ERROR: LoadError: "Reduction not found."
Stacktrace:
 [1] reduction_zero at C:\Users\89639\.julia\packages\LoopVectorization\O8WW6\src\modeling\costs.jl:469 [inlined]
 [2] add_reduction_update_parent!(::Array{LoopVectorization.Operation,1}, ::Array{Symbol,1}, ::Array{Symbol,1}, ::LoopVectorization.LoopSet, ::LoopVectorization.Operation, ::LoopVectorization.Instruction, ::Int64, ::Int64) at C:\Users\89639\.julia\packages\LoopVectorization\O8WW6\src\parse\add_compute.jl:188
in expression starting at REPL[4]:1
chriselrod commented 3 years ago

Yeah, I should add ifelse to the known reductions. Reductions have to be known to correctly initialize 0s for separate accumulators.

To SIMD a reduction, you need to do multiple reductions in parallel. Sometimes, that's possible because there are multiple in parallel you can do, e.g. you can do multiple j iterations in parallel:

for j in axes(m,2)
    s = -Inf
    for i in axes(m,1)
        s = ifelse(m[i]>=s, m[i,j], s)
    end
    x[j] = s
end

This also won't work at the moment, but there is no reason it shouldn't work even for unknown reductions: just mark the i loop as non-reorderable, and SIMD the j loop. I'll do that in the LV rewrite.

However, given just one loop:

for i in eachindex(m)
    s = ifelse(m[i]>=s, m[i], s)
end

doing multiple in parallel requires knowing what 0 is with respect to the reduction. With ifelse, 0 is whatever s was initialized as.

With +, it is 0.0. Someone could write

s = 3.4
for i in eachindex(m)
    s += x[i]
end

and LoopVectorization will create a bunch of accumulators initialized to 0.0, sum up x in parallel, combine these accumulators into a single scalar, and then finally add 3.4. My point here is that it has to know 0.0 is "zero", because if it initialized each accumulator to 3.4, once it combines the accumulators, the answer will be wrong (i.e., it'll be too large by 3.4*(number of accumulators - 1)).

This is why there is a "Reduction not found" error/why LoopVectorization needs to know what the reduction is. I'll add ifelse soonᵀᴹ.