JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
728 stars 65 forks source link

Custom A*X .+ b returning wrong results #379

Open bartvanerp opened 2 years ago

bartvanerp commented 2 years ago

First of all, great package! I was trying to implement a slight alteration to the matrix multiplication example from the Readme.md:

function custom_gemm!(C::Matrix{T}, A::Matrix{T}, B::Matrix{T}, b::Vector{T}) where { T }
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = b[m]
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

where I only changed the original line Cmn = zero(eltype(C)) to Cmn = b[m]. Basically I want this function to return A*X .+ b. Although this change seems trivial in my eyes, the implementation gives:

julia> using LoopVectorization

julia> Y, A, X, b = randn(2,2), randn(2,2), randn(2,2), ones(2)

julia> A*X .+ b
2×2 Matrix{Float64}:
 1.44203   0.534377
 0.931009  1.14601

julia> custom_gemm!(Y, A, X, b)
2×2 Matrix{Float64}:
 1.0  1.0
 1.0  1.0

It seems that the inner for-loop is ignored and Cmn is only set to b[m]. If I slightly change the above function the code works again:

function custom_gemm2!(C::Matrix{T}, A::Matrix{T}, B::Matrix{T}, b::Vector{T}) where { T }
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(T) 
        Cmn += b[m]
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

julia> custom_gemm2!(Y, A, X, b)
2×2 Matrix{Float64}:
 1.44203   0.534377
 0.931009  1.14601

I am not sure what is causing this issue, but I can imagine that a lot of people might accidentally run into this.

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 12

LoopVectorization v0.12.101

chriselrod commented 2 years ago

I am not sure what is causing this issue

Basically, LoopVectorization parses the expression in a single pass. If you write

for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(T) 

then it records that Cmn depends on m and n. If you write

for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = b[m]

then it records that Cmn depends on m, because it doesn't know yet you're going to sum over it.

In hindsight, single-pass parsing that also deduces the structure/behavior of the loop was not a good idea. It'd be better to first build an analyzable internal representation, and then use this to derive meaning.

Long term, I am rewriting LoopVectorization, which will address that issue (among others).

Short term, the simplest fix would probably be to add an extra field to the Operation struct that indicates the loop level it was initialized at, that the reduction code can later use. Should be a simple fix.