@avx in subspace matrix multiplication is much slower than StaticArrays

Roger-luo commented 3 years ago

I was trying to use @avx in the following subspace matrix multiplication code, but to my surprise, LoopVectorization is significantly slower than StaticArrays, I'm not sure if I'm doing it correctly

The subspace matrix multiplication can be defined as the following julia code

function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
    @inbounds @views for k in subspace
        idx = k .+ indices
        for b in 1:size(st, 2)
            st[idx, b] = U * st[idx, b]
        end
    end
    return st
end

where subspace and indices a set of integers to describe the subspace, and U is a matrix in the subspace, the matrix multiplication happens inside the subspace. An example of how to use the above function

using StaticArrays
using BenchmarkTools

n = 3
st = rand(ComplexF64, 1 << 8, 100)
y = similar(st, (1 << n, 100))
U = rand(ComplexF64, 1 << n, 1 << n)
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n)
indices = [1, 2, 5, 6, 9, 10, 13, 14]
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242]

julia> @benchmark subspace_smm!($(copy(st)), $indices, $SU, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  4.50 KiB
  allocs estimate:  32
  --------------
  minimum time:     89.259 μs (0.00% GC)
  median time:      115.608 μs (0.00% GC)
  mean time:        115.988 μs (0.10% GC)
  maximum time:     1.334 ms (91.20% GC)
  --------------
  samples:          10000
  evals/sample:     1

now if we implement the above matrix multiplication with plain for loop and use @avx, it can be implemented as following

function subspace_mm!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @avx for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end

however this then become significantly slower than the previous naive version using StaticArrays

julia> y = similar(st, (1 << n, 100))

julia> subspace_mm!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     203.286 μs (0.00% GC)
  median time:      213.306 μs (0.00% GC)
  mean time:        213.945 μs (0.00% GC)
  maximum time:     245.486 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

y above is a temp mem to store result of each subspace matrix multiplication, so we can assign them back to st. I'm testing this on Ryzen

julia> versioninfo()
Julia Version 1.6.0-DEV.1699
Commit 8a5ac94e96* (2020-12-08 03:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver2)
Environment:
  JULIA_EDITOR = "/home/roger/.vscode-server/bin/e5a624b788d92b8d34d1392e4c4d9789406efe8f/bin/code"
  JULIA_NUM_THREADS =

I'm not sure if it would be different on an AVX512 available machine, but I guess probably not the AVX512 issue since SA is much faster on the same machine?

Roger-luo commented 3 years ago

and for reference, if we replace the @avx with just @inbounds, and let Julia handle everything, it is a bit faster than @avx

julia> @benchmark BQCESubroutine.subspace_mm!($(copy(st)), $indices, $SU, $temp_st, $subspace)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     185.677 μs (0.00% GC)
  median time:      198.357 μs (0.00% GC)
  mean time:        198.696 μs (0.00% GC)
  maximum time:     217.576 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

chriselrod commented 3 years ago

This problem actually hits 5 current limitations in LoopVectorization. Which is quite impressive for such simple looking loops.

First of all,

julia> typeof.((st, indices, SU, y, subspace))
(Matrix{ComplexF64}, Vector{Int64}, SMatrix{8, 8, ComplexF64, 64}, Matrix{ComplexF64}, Vector{Int64})

julia> LoopVectorization.check_args(st, indices, SU, y, subspace)
false

This means that LoopVectorization doesn't support the types of arguments you're using. Instead of throwing an error, it applies @inbounds @fastmath and hopes for the best. It returns false because of two current limitations:

Only supporting primitive element types: Union{Bool, Base.HWReal}. This will probably be solved when making the code parsing/LoopSet construction code take advantage of the compiler tools, so that it can run optimization passes until operations on ComplexF64 are deconstructed into operations on individual Float64.
StaticArrays.SArray aren't supported. StaticArrays.MArray are. The problem with the former is that I can't get a pointer. I've played around with things like Ref(A::SArray), e.g. in the documentation, where I could speed up SMatrix multiplication following that approach, but it still lagged well behind just starting with MArrays in the first place, because the compiler will explicilty copy all the memory for every Ref, even if it is stack allocated, non-mutated memory is marked as invariant, and limited lifetimes marked. I don't know why that doesn't fix it. I've also considered trying to wrap them in a struct to emulate the behavior of a StridedPointer, i.e. createa a SIMD vector via repeated indexing. That will probably work better in many cases for regular vector loads, but I haven't been able to get the compiler to emit things like masked loads this way. Of the four limitations, this would be the easiest to fix, but as performance would still be lacking compared to the alternatives, I haven't put the time in. If you want maximize performance, sticking with the mutable arrays is better. As I demo here, you can make mutable arrays safe, stack allocated, and still pass them through non-inlined function boundaries. Having to use PaddedMatrices.@gc_preserve may not be the most ergonomic thing (and PaddedMatrices itself is lacking many featyures compared to StaticArrays, i.e. basically everything except matmul at the moment [and also doesn't support complex numbers], but I'll at least add things like linear algebra functionality once I address some LoopVectorization limitations).

LoopVectorization.check_args looks for these limitations, and returns true/false depending on whether the arrays are compatible. If false, it'll use a fallback loop which is just the original loop with @inbounds @fastmath.

Which actually makes this a little surprising:

and for reference, if we replace the @avx with just @inbounds, and let Julia handle everything, it is a bit faster than @avx

Because that suggests @inbounds @fastmath is slower than just @inbounds on your machine. I see roughly equivalent times:

function subspace_mm!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @avx for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end
function subspace_mm_inbounds!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @inbounds for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end
function subspace_mm_inbounds_fastmath!(st::AbstractMatrix, indices, U::AbstractMatrix{T}, y::AbstractMatrix, subspace) where T
    @inbounds @fastmath for k in subspace
        for b in 1:size(st, 2)
            for i in 1:size(U, 1)
                y[i, b] = zero(T)
                for j in 1:size(U, 2)
                    y[i, b] += U[i, j] * st[k + indices[j], b]
                end
            end

            for i in 1:size(U, 1)
                st[k + indices[i], b] = y[i, b]
            end
        end
    end
    return st
end

This yields

julia> y = similar(st, (1 << n, 100));

julia> @benchmark subspace_mm!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     216.051 μs (0.00% GC)
  median time:      216.350 μs (0.00% GC)
  mean time:        216.578 μs (0.00% GC)
  maximum time:     360.086 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark subspace_mm_inbounds!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     215.090 μs (0.00% GC)
  median time:      216.418 μs (0.00% GC)
  mean time:        216.404 μs (0.00% GC)
  maximum time:     351.244 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark subspace_mm_inbounds_fastmath!($(copy(st)), $indices, $SU, $y, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     215.981 μs (0.00% GC)
  median time:      217.641 μs (0.00% GC)
  mean time:        217.920 μs (0.00% GC)
  maximum time:     352.142 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

For reference, I get this on the first benchmark

julia> @benchmark subspace_smm!($(copy(st)), $indices, $SU, $subspace)
BenchmarkTools.Trial:
  memory estimate:  4.50 KiB
  allocs estimate:  32
  --------------
  minimum time:     114.603 μs (0.00% GC)
  median time:      116.125 μs (0.00% GC)
  mean time:        116.681 μs (0.06% GC)
  maximum time:     880.390 μs (79.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

So, how can we work around those two limitations? Split the complex arrays into the real and imaginary components, and replace SMatrix with MMatrix. For example:

using StaticArrays, LoopVectorization, BenchmarkTools

n = 3
st = rand(ComplexF64, 1 << 8, 100);
y = similar(st, (1 << n, 100));
U = rand(ComplexF64, 1 << n, 1 << n);
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n);
indices = [1, 2, 5, 6, 9, 10, 13, 14];
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242];

function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
    @inbounds @views for k in subspace
        idx = k .+ indices
        for b in 1:size(st, 2)
            st[idx, b] = U * st[idx, b]
        end
    end
    return st
end

function subspace_mm!(
    st_re::AbstractMatrix, st_im::AbstractMatrix,
    indices, U_re::AbstractMatrix{T}, U_im::AbstractMatrix{T}, subspace
) where {T <: Base.HWReal}
    for k in subspace
        @avx unroll=(2,8) for b in axes(st_re, 2)
            for i in axes(U_re, 1)
                y_re= zero(T)
                y_im = zero(T)
                for j in axes(U_re, 2)
                    U_re_i_j = U_re[i,j]
                    U_im_i_j = U_im[i,j]
                    st_re_j_b = st_re[k + indices[j], b]
                    st_im_j_b = st_im[k + indices[j], b]
                    y_re += U_re_i_j * st_re_j_b - U_im_i_j * st_im_j_b
                    y_im += U_re_i_j * st_im_j_b + U_im_i_j * st_re_j_b
                end
                st_re[k + indices[i], b] = y_re
                st_im[k + indices[i], b] = y_im
            end
        end
    end
    return (st_re, st_im)
end

st_re = transpose(real.(transpose(st)));
st_im = transpose(imag.(transpose(st)));
MU_re = MMatrix(real.(SU));
MU_im = MMatrix(imag.(SU));

st2 = subspace_smm!((copy(st)), indices, SU, subspace);
st_re2, st_im2 = subspace_mm!((deepcopy(st_re)), (deepcopy(st_im)), indices, MU_re, MU_im,  subspace);
st_re2 ≈ real.(st2)
st_im2 ≈ imag.(st2)

sindices = SVector(indices...);
@benchmark subspace_smm!($(copy(st)), $sindices, $SU, $subspace)

@benchmark subspace_mm!($(deepcopy(st_re)), $(deepcopy(st_im)), $indices, $MU_re, $MU_im,  $subspace)

Note all those transposes, and the deepcopys in the benchmark in place of copy (copy will createa a regular Array, transposing the memory layout, deepcopy will make a Transpose{T,Array{T,2}} just like the parent). This is because I want a row-major memory layout. I want row-major, because column-major st_re[k + indices[i], b] would require accessing discontiguous memory. Successive ks are not necessarilly 1 apart, and indices face the same problem. Either would force using gather and scatter operations to load/store elements, which are much slower than the corresponding contiguous load/stores. If we tranpose the underlying memory (make the array row-major), then successive bs are indeed contiguous.

Before showing the benchmarks, the other 3 limitations:

Currently, LoopVectorization only supports "perfect" loop nests, meaning nesting-doll like loops. Each loop has only 1 or 0 loops inside it. The original @avx code had two for i... loops within the for b... loop. For some reason I never bothered to make LoopVectorization check for this, so it'll result in undefined behavior. So in the above code, I collapsed those loops together and got rid of the need for y. I've started (slowly) working on a rewrite of LoopVectorization that will address this, allowing there to be multiple loops at each level of the nest. It will also be able to fuse these loops as an optimization.
It currently doesn't track dependencies between loop iterations, and assumes they can be reordered arbitrarilly. I'm also addressing that in the rewrite, but it's something to watch out for here, because we're loading from and storing to st_re. I think even after monitoring dependencies, it'll make really aggressive assumptions about indices, i.e. that it can load/store from multiple in parallel without collisions. The answers were right after I moved @avx inside the k loop, instead of outside it, and also manually set unroll=(2,8) to get it to fully unroll i by 8:
```
julia> U_re = MU_re; U_im = MU_im;
```

julia> ls = LoopVectorization.@avx_debug for b in axes(st_re, 2) for i in axes(U_re, 1) y_re= zero(T) y_im = zero(T) for j in axes(U_re, 2) U_re_i_j = U_re[i,j] U_im_i_j = U_im[i,j] st_re_j_b = st_re[k + indices[j], b] st_im_j_b = st_im[k + indices[j], b] y_re += U_re_i_j st_re_j_b - U_im_i_j st_im_j_b y_im += U_re_i_j st_im_j_b + U_im_i_j st_re_j_b end st_re[k + indices[i], b] = y_re st_im[k + indices[i], b] = y_im end end; OPS = Tuple{:numericconstant, Symbol("##zero#823"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :numericconstant, Symbol("##zero#824"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000003, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x06), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000003, 0x0000000000000000, 0x0000000000000000, 0x0000000000000506, LoopVectorization.compute, 0x00, 0x07), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000007, LoopVectorization.memload, 0x04, 0x08), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000007, LoopVectorization.memload, 0x05, 0x09), :LoopVectorization, :, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000000, 0x0000000000000000, 0x0000000000000409, LoopVectorization.compute, 0x00, 0x0a), :LoopVectorization, :-, LoopVectorization.OperationStruct(0x0000000000000123, 0x0000000000000000, 0x0000000000000003, 0x000000000000010a, LoopVectorization.compute, 0x00, 0x0b), :LoopVectorization, :vfmadd, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x000000000003080b, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x000000000000000c, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000000, 0x0000000000000000, 0x0000000000000408, LoopVectorization.compute, 0x00, 0x0c), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000000, 0x0000000000000003, 0x0000000000000e02, LoopVectorization.compute, 0x00, 0x0d), :LoopVectorization, :vfmadd, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x000000000003090f, LoopVectorization.compute, 0x00, 0x02), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000010, LoopVectorization.compute, 0x00, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x06, 0x0e), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000512, LoopVectorization.compute, 0x00, 0x0f), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000003, 0x0000000000000000, 0x0000000000000d13, LoopVectorization.memstore, 0x07, 0x10), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000003, 0x0000000000000000, 0x0000000000001113, LoopVectorization.memstore, 0x08, 0x11)} ARF = Tuple{LoopVectorization.ArrayRefStruct{:U_re, Symbol("##vptr##_U_re")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:U_im, Symbol("##vptr##_U_im")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:indices, Symbol("##vptr##_indices")}(0x0000000000000001, 0x0000000000000003, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:st_re, Symbol("##vptr##_st_re")}(0x0000000000000201, 0x0000000000000701, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:st_im, Symbol("##vptr##_st_im")}(0x0000000000000201, 0x0000000000000701, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:indices, Symbol("##vptr##_indices")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:st_re, Symbol("##vptr##_st_re")}(0x0000000000000201, 0x0000000000000f01, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:st_im, Symbol("##vptr##_st_im")}(0x0000000000000201, 0x0000000000000f01, 0x0000000000000000)} AM = Tuple{0, Tuple{}, Tuple{5}, Tuple{}, Tuple{}, Tuple{(1, LoopVectorization.IntOrFloat), (2, LoopVectorization.IntOrFloat)}, Tuple{}} LPSYM = Tuple{:b, :i, :j} LB = Tuple{ArrayInterface.OptionallyStaticUnitRange{ArrayInterface.StaticInt{1}, Int64}, ArrayInterface.OptionallyStaticUnitRange{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{8}}, ArrayInterface.OptionallyStaticUnitRange{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{8}}} vargs = (VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{ArrayInterface.StaticInt{8}, ArrayInterface.StaticInt{64}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x00007ff2ba5facb0, (ArrayInterface.StaticInt{8}(), ArrayInterface.StaticInt{64}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{ArrayInterface.StaticInt{8}, ArrayInterface.StaticInt{64}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x00007ff2ba5f8450, (ArrayInterface.StaticInt{8}(), ArrayInterface.StaticInt{64}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}}}(Ptr{Int64} @0x00007ff2b880eb10, (ArrayInterface.StaticInt{8}(),), (ArrayInterface.StaticInt{1}(),)), VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x000055ecd326d680, (800, ArrayInterface.StaticInt{8}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x000055ecd246a540, (800, ArrayInterface.StaticInt{8}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), VectorizationBase.StridedPointer{Int64, 1, 1, 0, (1,), Tuple{ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}}}(Ptr{Int64} @0x00007ff2b880eb10, (ArrayInterface.StaticInt{8}(),), (ArrayInterface.StaticInt{1}(),)), VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x000055ecd326d680, (800, ArrayInterface.StaticInt{8}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), VectorizationBase.StridedPointer{Float64, 2, 2, 0, (2, 1), Tuple{Int64, ArrayInterface.StaticInt{8}}, Tuple{ArrayInterface.StaticInt{1}, ArrayInterface.StaticInt{1}}}(Ptr{Float64} @0x000055ecd246a540, (800, ArrayInterface.StaticInt{8}()), (ArrayInterface.StaticInt{1}(), ArrayInterface.StaticInt{1}())), 0)

julia> LoopVectorization.choose_order(ls) # Chooses to unroll b by 2 and i by 4, additionally vectorizes b. ([:b, :i, :j], :b, :i, :b, 2, 4)

It chose to unroll `b` and `i` by 2  and 4 on my computer, but unrolling `i` by 4 caused incorrect answers. So I set `unroll=(2,8)` instead. This is all architecture dependent, meaning saying `unroll=(2,8)` could mean something different on another computer. You may want to double check on your own. That makes it a truly aweful hack, I just wanted to post something demoing (a) correct results and (b) a speedup.
More importantly, this is a terrible hack in that we're relying on LoopVectorization's current failure to track dependencies between loop iterations. If we remove `@avx`, we'll get the wrong answer. LLVM won't unroll (i.e., unroll by 1) instead of unrolling by 8, making only 1/8 answers correct. We're counting on this limitation to set a large unroll factor and get behavior matching matmul on blocks.

5. It currently doesn't like `for k in subspace`, either. The fix of course is just `for _k in eachindex(subspace); k = subspace[_k]`. I'll address that in the rewrite to make the transform automatically. The current version already does similar tricks for `CartesianIndices{N}`, substituting the one loop for `N`, but the associated code assumes `ArrayInterface.static_first(r):ArrayInterface.static_last(r)` is approximately idempotent, which obviously isn't the case for vectors, and it makes this assumption in the macro rather than the `@generated` body, meaning the assumption is made before any awareness of types. So it'd take a bit more work to fix, making it easier to lump in with rewriting the rest of the code anyway.

Finally, a version without those transposed wrappers: 
```julia
function subspace_mm_t!(
    st_re::AbstractMatrix, st_im::AbstractMatrix,
    indices, U_re::AbstractMatrix{T}, U_im::AbstractMatrix{T}, subspace
) where {T <: Base.HWReal}
    for k in subspace
        @avx unroll=(2,8) for b in axes(st_re, 1)
            for i in axes(U_re, 1)
                y_re= zero(T)
                y_im = zero(T)
                for j in axes(U_re, 2)
                    U_re_i_j = U_re[i,j]
                    U_im_i_j = U_im[i,j]
                    st_re_j_b = st_re[b, k + indices[j]]
                    st_im_j_b = st_im[b, k + indices[j]]
                    y_re += U_re_i_j * st_re_j_b - U_im_i_j * st_im_j_b
                    y_im += U_re_i_j * st_im_j_b + U_im_i_j * st_re_j_b
                end
                st_re[b, k + indices[i]] = y_re
                st_im[b, k + indices[i]] = y_im
            end
        end
    end
    return (st_re, st_im)
end

st_re_t = real.(transpose(st));
st_im_t = imag.(transpose(st));

st_re_t2, st_im_t2 = subspace_mm_t!((copy(st_re_t)), (copy(st_im_t)), indices, MU_re, MU_im,  subspace);
st_re_t2 ≈ real.(st2)'
st_im_t2 ≈ imag.(st2)'

@benchmark subspace_mm_t!($(copy(st_re_t)), $(copy(st_im_t)), $indices, $MU_re, $MU_im,  $subspace)

Finally, I get

julia> st2 = subspace_smm!((copy(st)), indices, SU, subspace);

julia> st_re2, st_im2 = subspace_mm!((deepcopy(st_re)), (deepcopy(st_im)), indices, MU_re, MU_im,  subspace);

julia> st_re2 ≈ real.(st2)
true

julia> st_im2 ≈ imag.(st2)
true

julia> sindices = SVector(indices...);

julia> @benchmark subspace_smm!($(copy(st)), $sindices, $SU, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     110.086 μs (0.00% GC)
  median time:      110.462 μs (0.00% GC)
  mean time:        110.517 μs (0.00% GC)
  maximum time:     252.352 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark subspace_mm!($(deepcopy(st_re)), $(deepcopy(st_im)), $indices, $MU_re, $MU_im,  $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.655 μs (0.00% GC)
  median time:      33.060 μs (0.00% GC)
  mean time:        33.108 μs (0.00% GC)
  maximum time:     98.802 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> st_re_t2, st_im_t2 = subspace_mm_t!((copy(st_re_t)), (copy(st_im_t)), indices, MU_re, MU_im,  subspace);

julia> st_re_t2 ≈ real.(st2)'
true

julia> st_im_t2 ≈ imag.(st2)'
true

julia> @benchmark subspace_mm_t!($(copy(st_re_t)), $(copy(st_im_t)), $indices, $MU_re, $MU_im,  $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.613 μs (0.00% GC)
  median time:      31.952 μs (0.00% GC)
  mean time:        31.992 μs (0.00% GC)
  maximum time:     87.974 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Another solution

using LoopVectorization, PaddedMatrices
function AmulB!(Cre, Cim, Are, Aim, Bre, Bim)
    @avx for n in axes(Cre,2), m in axes(Cre,1)
        Cre_mn = zero(eltype(Cre))
        Cim_mn = zero(eltype(Cim))
        for k in axes(Bre,1)
            Cre_mn += Are[m,k] * Bre[k,n] - Aim[m,k] * Bim[k,n]
            Cim_mn += Are[m,k] * Bim[k,n] + Aim[m,k] * Bre[k,n]
        end
        Cre[m,n] = Cre_mn
        Cim[m,n] = Cim_mn
    end
end

function subspace_smm_mul!(st_re::AbstractMatrix{T}, st_im::AbstractMatrix, indices, U_re::AbstractMatrix, U_im, subspace) where {T}
    A_re = StrideArray{T}(undef, (PaddedMatrices. size(st_re, StaticInt(1)), PaddedMatrices.static_length(indices)))
    A_im = similar(A_re)
    C_re = similar(A_re)
    C_im = similar(A_im)
    for k in subspace
        idx = k .+ indices
        @avx for i in eachindex(idx), j in axes(A_re,1)
            A_re[j,i] = st_re[j,idx[i]]
            A_im[j,i] = st_im[j,idx[i]]
        end
        @gc_preserve AmulB!(C_re, C_im, A_re, A_im, U_re', U_im')
        @avx for i in eachindex(idx), j in axes(A_re,1)
            st_re[j,idx[i]] = C_re[j,i]
            st_im[j,idx[i]] = C_im[j,i] 
        end
    end
    return st_re, st_im
end

# Need a more convenient API for this
function stride_array(A, sz = PaddedMatrices.size(A))
    StrideArray(PtrArray(stridedpointer(A), sz, PaddedMatrices.ArrayInterface.dense_dims(A)), A)
end

# Make the first dimension static
s_st_re = stride_array(st_re_t, (StaticInt(size(st_re_t,1)),size(st_re_t,2)));
s_st_im = stride_array(st_im_t, (StaticInt(size(st_im_t,1)),size(st_im_t,2)));

s_indices = stride_array(indices, (StaticInt(length(indices)),));

s_st_re2, s_st_im2 = subspace_smm_mul!(copy(s_st_re), copy(s_st_im), s_indices, MU_re, MU_im, subspace)

s_st_re2 ≈ real.(st2)'
s_st_im2 ≈ imag.(st2)'

@benchmark subspace_smm_mul!($(copy(s_st_re)), $(copy(s_st_im)), $s_indices, $MU_re, $MU_im, $subspace)

This yields:

julia> s_st_re2 ≈ real.(st2)'
true

julia> s_st_im2 ≈ imag.(st2)'
true

julia> @benchmark subspace_smm_mul!($(copy(s_st_re)), $(copy(s_st_im)), $s_indices, $MU_re, $MU_im, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.063 μs (0.00% GC)
  median time:      43.460 μs (0.00% GC)
  mean time:        43.526 μs (0.00% GC)
  maximum time:     97.375 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

This has the advantage that it doesn't rely on undefined behavior and should work when you change array sizes. It has the disadvantage that, as is, it does a lot of copying of memory back and fourth.

Once I (or someone else) implements:

julia> stridedpointer(view(rand(16,16), :, [1,2,5,7]))
ERROR: "Memory access for SubArray{Float64, 2, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Vector{Int64}}, false} not implemented yet."
Stacktrace:
 [1] memory_reference
   @ ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:45 [inlined]
 [2] memory_reference
   @ ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:35 [inlined]
 [3] stridedpointer(A::SubArray{Float64, 2, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Vector{Int64}}, false})
   @ VectorizationBase ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:72
 [4] top-level scope
   @ REPL[68]:1

We could get rid of Cre/Cim, and just pass views to st_re/st_im.

Roger-luo commented 3 years ago

wow thanks for such a detailed reply! I didn't expect there's still a 3x speedup comparing to SA version, this is very impressive! But I'm wondering if @avx accepts reinterpret since I won't be able to change the memory layout of st and because this operation will be performing at st size between (1 << 8, 1000) ~ (1 << 30, 1000), using real and imag to copy the memory could be very expensive at larger scale. (or maybe it's possible for LoopVectorization to specialize ComplexF64 and ComplexF32 using reinterpret?)

chriselrod commented 3 years ago

The new reinterpret(reshape, ...) is great for this:

julia> A = rand(ComplexF64, 3, 4)
3×4 Matrix{ComplexF64}:
 0.255597+0.833728im  0.538941+0.632544im  0.689762+0.322061im  0.0550905+0.186901im
 0.392467+0.609713im  0.408867+0.111692im  0.574269+0.602653im   0.760552+0.385366im
 0.221208+0.544642im  0.928497+0.13649im   0.782709+0.872457im   0.460958+0.809076im

julia> reinterpret(reshape, Float64, A)
2×3×4 reinterpret(reshape, Float64, ::Matrix{ComplexF64}) with eltype Float64:
[:, :, 1] =
 0.255597  0.392467  0.221208
 0.833728  0.609713  0.544642

[:, :, 2] =
 0.538941  0.408867  0.928497
 0.632544  0.111692  0.13649

[:, :, 3] =
 0.689762  0.574269  0.782709
 0.322061  0.602653  0.872457

[:, :, 4] =
 0.0550905  0.760552  0.460958
 0.186901   0.385366  0.809076

It should work:

julia> stridedpointer(reinterpret(reshape, Float64, A))
VectorizationBase.StridedPointer{Float64, 3, 1, 0, (1, 2, 3), Tuple{StaticInt{8}, StaticInt{16}, Int64}, Tuple{StaticInt{1}, StaticInt{1}, StaticInt{1}}}(Ptr{Float64} @0x00007f7197724050, (StaticInt{8}(),
StaticInt{16}(), 48), (StaticInt{1}(), StaticInt{1}(), StaticInt{1}()))

julia> using LoopVectorization

julia> LoopVectorization.check_args(reinterpret(reshape, Float64, A))
true

Here is an implementation using the same inputs as the old version:

using StaticArrays, BenchmarkTools

n = 3
st = rand(ComplexF64, 1 << 8, 100);
y = similar(st, (1 << n, 100));
U = rand(ComplexF64, 1 << n, 1 << n);
SU = @SMatrix rand(ComplexF64, 1 << n, 1 << n);
indices = [1, 2, 5, 6, 9, 10, 13, 14];
subspace = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242];

function subspace_smm!(st::AbstractMatrix, indices, U::AbstractMatrix, subspace)
    @inbounds @views for k in subspace
        idx = k .+ indices
        for b in 1:size(st, 2)
            st[idx, b] = U * st[idx, b]
        end
    end
    return st
end

using PaddedMatrices, LoopVectorization

function subspace_smm_avx!(st::AbstractMatrix{Complex{T}}, indices::StrideArray, U::AbstractMatrix, subspace) where {T}

    C_re = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{8}()))
    C_im = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{8}()))
    U_re = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}())))
    U_im = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}())))
    Ur = reinterpret(reshape, T, U)
    @inbounds @simd ivdep for i ∈ eachindex(U_re)
        U_re[i] = Ur[1,i]
        U_im[i] = Ur[2,i]
    end
    str = reinterpret(reshape, T, st)

    Bmax = size(st,2)
    for k in subspace
        # idx = k .+ indices
        # _k = k - 1
        for _b ∈ 0:(Bmax-1) >>> 3
            b =    _b << 3;
            bmax = b + 8
            if bmax ≤ Bmax # full static block
                @avx for n ∈ 1:8, m ∈ axes(U, 1)
                    C_re_m_n = zero(T)
                    C_im_m_n = zero(T)
                    for i ∈ axes(U, 2)
                        str_i = k + indices[i]
                        C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b]
                        C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b]
                    end
                    C_re[m,n] = C_re_m_n
                    C_im[m,n] = C_im_m_n
                end
                @avx for n ∈ 1:8, m ∈ axes(U, 1)
                    str_m = k + indices[m]
                    str[1,str_m,n+b] = C_re[m,n]
                    str[2,str_m,n+b] = C_im[m,n]
                end
                # AmulB!(C_re, C_im, U_re, U_im, 
            else # dynamic block
                Nmax = 8 + Bmax - bmax
                @avx for n ∈ 1:Nmax, m ∈ axes(U, 1)
                    C_re_m_n = zero(T)
                    C_im_m_n = zero(T)
                    for i ∈ axes(U, 2)
                        str_i = k + indices[i]
                        C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b]
                        C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b]
                    end
                    C_re[m,n] = C_re_m_n
                    C_im[m,n] = C_im_m_n
                end
                @avx for n ∈ 1:Nmax, m ∈ axes(U, 1)
                    str_m = k + indices[m]
                    str[1,str_m,n+b] = C_re[m,n]
                    str[2,str_m,n+b] = C_im[m,n]
                end
            end
        end
    end
    return st
end

static_indices = SVector(indices...);
stride_indices = StrideArray(MVector(static_indices));

st2 = subspace_smm!((copy(st)), static_indices, SU, subspace);
st3 = subspace_smm_avx!((copy(st)), stride_indices, SU, subspace);
st2 ≈ st3

@benchmark subspace_smm!($(copy(st)), $static_indices, $SU, $subspace)
@benchmark subspace_smm_avx!($(copy(st)), $stride_indices, $SU, $subspace)

Results (on a different computer than last time):

julia> st2 = subspace_smm!((copy(st)), static_indices, SU, subspace);

julia> st3 = subspace_smm_avx!((copy(st)), stride_indices, SU, subspace);

julia> st2 ≈ st3
true

julia> @benchmark subspace_smm!($(copy(st)), $static_indices, $SU, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     120.998 μs (0.00% GC)
  median time:      124.350 μs (0.00% GC)
  mean time:        127.713 μs (0.00% GC)
  maximum time:     207.639 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark subspace_smm_avx!($(copy(st)), $stride_indices, $SU, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.980 μs (0.00% GC)
  median time:      38.166 μs (0.00% GC)
  mean time:        39.356 μs (0.00% GC)
  maximum time:     78.841 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

From a performance perspective, our goal is to make the most out of our vector units. That now means, we want to SIMD U. We pay the price of going from array of structs (U) to separate arrays (U_re, U_im) once, up front. While loading vectors from U, we're only loading scalars from str, so we can load directly / no need to change memory layout.

We do need a separate array to copy memory into however, because neither A or B is allowed to alias C in C = A * B. So we create a destination C_{re/im} with a layout that matches U_{re/im} for efficient writing, and then copy it over to st when done.

Also, we iterate over blocks of b rather than 1 at a time. That's because loads from U are independent of b, so by unrolling it we get more reuse. AVX512 would probably benefit frmo even larger block sizes, but AVX2 probably won't really benefit from more than 2 (based on register use/available in the inner loops). LoopVectorization will subdivide those blocks appropriately, so making them larger should be safe.

I wanted a statically sized upper limit, so that we could stack allocate all our temporaries.

Roger-luo commented 3 years ago

Wow this is very impressive. I further benchmarked your implementation with a manual loop version, and a fully expanded 4x4 and 8x8 version, @avx is slower in the 2x2 and 4x4 case (which I guess requires some tune on the block size, but these two cases are fairly easy to implement directly by unrolling the loop), but faster when U is larger than 8x8. I have to change the implementation a bit since I need to generate the subspace indices on the fly (the subspace and comspace function), I'll post a repo link later for the utils of creating subspace object so one can reproduce the benchmark I list here.

Implementation

```julia function subspace_mul!(st::AbstractMatrix, comspace, U::AbstractMatrix{T}, subspace) where T y = similar(st, (size(U, 1), )) indices = [b + 1 for b in comspace]::Vector{Int} idx = similar(indices) @inbounds for k in subspace, b in 1:size(st, 2) for i in 1:size(U, 1) idx[i] = k + indices[i] end for i in 1:size(U, 1) y[i] = zero(T) for j in 1:size(U, 2) y[i] += U[i, j] * st[idx[j], b] end end for i in 1:size(U, 1) st[idx[i], b] = y[i] end end return st end function subspace_mul_avx!(st::AbstractMatrix, comspace, U::AbstractMatrix{Complex{T}}, subspace) where T indices = StrideArray{Int}(undef, (length(comspace), )) @simd ivdep for i in 1:length(comspace) indices[i] = comspace[i] + 1 end C_re = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}())) C_im = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}())) U_re = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}()))) U_im = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}()))) Ur = reinterpret(reshape, T, U) @inbounds @simd ivdep for i ∈ eachindex(U_re) U_re[i] = Ur[1,i] U_im[i] = Ur[2,i] end str = reinterpret(reshape, T, st) Bmax = size(st,2) for k in subspace # idx = k .+ indices # _k = k - 1 for _b ∈ 0:(Bmax-1) >>> 5 b = _b << 5; bmax = b + 32 if bmax ≤ Bmax # full static block @avx for n ∈ 1:32, m ∈ axes(U, 1) C_re_m_n = zero(T) C_im_m_n = zero(T) for i ∈ axes(U, 2) str_i = k + indices[i] C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b] C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b] end C_re[m,n] = C_re_m_n C_im[m,n] = C_im_m_n end @avx for n ∈ 1:32, m ∈ axes(U, 1) str_m = k + indices[m] str[1,str_m,n+b] = C_re[m,n] str[2,str_m,n+b] = C_im[m,n] end # AmulB!(C_re, C_im, U_re, U_im, else # dynamic block Nmax = 32 + Bmax - bmax @avx for n ∈ 1:Nmax, m ∈ axes(U, 1) C_re_m_n = zero(T) C_im_m_n = zero(T) for i ∈ axes(U, 2) str_i = k + indices[i] C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b] C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b] end C_re[m,n] = C_re_m_n C_im[m,n] = C_im_m_n end @avx for n ∈ 1:Nmax, m ∈ axes(U, 1) str_m = k + indices[m] str[1,str_m,n+b] = C_re[m,n] str[2,str_m,n+b] = C_im[m,n] end end end end return st end function subspace_mul4x4!(st::AbstractMatrix{T}, comspace, U::AbstractMatrix, subspace, offset=0) where T Base.Cartesian.@nextract 4 indices i -> comspace[i] + 1 Base.Cartesian.@nextract 4 U i->begin Base.Cartesian.@nextract 4 U_i j->U[i, j] end @inbounds for k in subspace Base.Cartesian.@nextract 4 idx i-> k + indices_i + offset for b in 1:size(st, 2) Base.Cartesian.@nexprs 4 i -> begin y_i = zero(T) Base.Cartesian.@nexprs 4 j -> begin y_i += U_i_j * st[idx_j, b] end end Base.Cartesian.@nexprs 4 i -> begin st[idx_i, b] = y_i end end end return st end function subspace_mul8x8!(st::AbstractMatrix{T}, comspace, U::AbstractMatrix, subspace, offset=0) where T Base.Cartesian.@nextract 8 indices i -> comspace[i] + 1 Base.Cartesian.@nextract 8 U i->begin Base.Cartesian.@nextract 8 U_i j->U[i, j] end @inbounds for k in subspace Base.Cartesian.@nextract 8 idx i-> k + indices_i + offset for b in 1:size(st, 2) Base.Cartesian.@nexprs 8 i -> begin y_i = zero(T) Base.Cartesian.@nexprs 8 j -> begin y_i += U_i_j * st[idx_j, b] end end Base.Cartesian.@nexprs 8 i -> begin st[idx_i, b] = y_i end end end return st end ```

I set the block size to 32 here, since I find it works better than 8 on my machine. But I'm not sure how should one tune this parameter.

Benchmark

4x4 benchmark ```julia n = 8 locs = Locations((1, 3)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 2, 1 << 2) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul4x4!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 44.049 μs (0.00% GC) median time: 45.199 μs (0.00% GC) mean time: 45.260 μs (0.00% GC) maximum time: 98.788 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 368 bytes allocs estimate: 3 -------------- minimum time: 145.357 μs (0.00% GC) median time: 169.142 μs (0.00% GC) mean time: 172.254 μs (0.00% GC) maximum time: 218.237 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 2.80 KiB allocs estimate: 5 -------------- minimum time: 91.348 μs (0.00% GC) median time: 93.869 μs (0.00% GC) mean time: 94.103 μs (0.00% GC) maximum time: 107.998 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 ``` 8x8 benchmark ```julia n = 8 locs = Locations((1, 3, 4)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 3, 1 << 3) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul8x8!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 100.238 μs (0.00% GC) median time: 102.818 μs (0.00% GC) mean time: 102.973 μs (0.00% GC) maximum time: 129.718 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 496 bytes allocs estimate: 3 -------------- minimum time: 203.286 μs (0.00% GC) median time: 234.566 μs (0.00% GC) mean time: 226.253 μs (0.00% GC) maximum time: 274.735 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 5.61 KiB allocs estimate: 5 -------------- minimum time: 72.709 μs (0.00% GC) median time: 74.358 μs (0.00% GC) mean time: 74.602 μs (0.00% GC) maximum time: 213.575 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 ``` 16x16 benchmark ```julia n = 8 locs = Locations((1, 3, 4, 5)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 4, 1 << 4) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 752 bytes allocs estimate: 3 -------------- minimum time: 421.922 μs (0.00% GC) median time: 438.516 μs (0.00% GC) mean time: 448.164 μs (0.00% GC) maximum time: 703.526 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 12.70 KiB allocs estimate: 5 -------------- minimum time: 101.148 μs (0.00% GC) median time: 104.228 μs (0.00% GC) mean time: 104.850 μs (0.43% GC) maximum time: 2.390 ms (94.51% GC) -------------- samples: 10000 evals/sample: 1 ```

But I still have a few questions about the above implementation:

I tried to use transpose(rand(100, 1<<8)) instead of rand(1<<8, 100), based on previous discussion, however, stridedpointer complains

ERROR: ArgumentError: Parent must be strided.
Stacktrace:
  [1] |>(x::ArgumentError, f::typeof(throw))
    @ Base ./operators.jl:839
  [2] strides(a::Base.ReinterpretArray{Float64, 3, ComplexF64, LinearAlgebra.Transpose{ComplexF64, Matrix{ComplexF64}}, true})
    @ Base ./reinterpretarray.jl:157
  [3] strides
    @ ~/.julia/packages/ArrayInterface/q5Rwc/src/stridelayout.jl:287 [inlined]
  [4] bytestrides
    @ ~/.julia/packages/VectorizationBase/oXigE/src/strided_pointers/stridedpointers.jl:21 [inlined]
  [5] stridedpointer
    @ ~/.julia/packages/VectorizationBase/oXigE/src/strided_pointers/stridedpointers.jl:72 [inlined]

is it beneficial to be able to use the more contiguous layout in the last implementation? if it is, how can I use it with StrideArray?

In my benchmark, it seems Julia fails to allocate the tempries on stack, I tried to remove the subspace and comspace and use a constant indices like the previous benchmarks, but that didn't help. I'm running this on Julia-1.6, could this because of the updates in Julia-1.6? @chriselrod which julia version were you using?

Roger-luo commented 3 years ago

Wow this is very impressive. I further benchmarked your implementation with a manual loop version, and a fully expanded 4x4 and 8x8 version, @avx is slower in the 2x2 and 4x4 case (which I guess requires some tune on the block size, but these two cases are fairly easy to implement directly by unrolling the loop), but faster when U is larger than 8x8. I have to change the implementation a bit since I need to generate the subspace indices on the fly (the subspace and comspace function), I'll post a repo link later for the utils of creating subspace object so one can reproduce the benchmark I list here.

Implementation

```julia function subspace_mul!(st::AbstractMatrix, comspace, U::AbstractMatrix{T}, subspace) where T y = similar(st, (size(U, 1), )) indices = [b + 1 for b in comspace]::Vector{Int} idx = similar(indices) @inbounds for k in subspace, b in 1:size(st, 2) for i in 1:size(U, 1) idx[i] = k + indices[i] end for i in 1:size(U, 1) y[i] = zero(T) for j in 1:size(U, 2) y[i] += U[i, j] * st[idx[j], b] end end for i in 1:size(U, 1) st[idx[i], b] = y[i] end end return st end function subspace_mul_avx!(st::AbstractMatrix, comspace, U::AbstractMatrix{Complex{T}}, subspace) where T indices = StrideArray{Int}(undef, (length(comspace), )) @simd ivdep for i in 1:length(comspace) indices[i] = comspace[i] + 1 end C_re = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}())) C_im = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}())) U_re = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}()))) U_im = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}()))) Ur = reinterpret(reshape, T, U) @inbounds @simd ivdep for i ∈ eachindex(U_re) U_re[i] = Ur[1,i] U_im[i] = Ur[2,i] end str = reinterpret(reshape, T, st) Bmax = size(st,2) for k in subspace # idx = k .+ indices # _k = k - 1 for _b ∈ 0:(Bmax-1) >>> 5 b = _b << 5; bmax = b + 32 if bmax ≤ Bmax # full static block @avx for n ∈ 1:32, m ∈ axes(U, 1) C_re_m_n = zero(T) C_im_m_n = zero(T) for i ∈ axes(U, 2) str_i = k + indices[i] C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b] C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b] end C_re[m,n] = C_re_m_n C_im[m,n] = C_im_m_n end @avx for n ∈ 1:32, m ∈ axes(U, 1) str_m = k + indices[m] str[1,str_m,n+b] = C_re[m,n] str[2,str_m,n+b] = C_im[m,n] end # AmulB!(C_re, C_im, U_re, U_im, else # dynamic block Nmax = 32 + Bmax - bmax @avx for n ∈ 1:Nmax, m ∈ axes(U, 1) C_re_m_n = zero(T) C_im_m_n = zero(T) for i ∈ axes(U, 2) str_i = k + indices[i] C_re_m_n += U_re[m,i] * str[1,str_i,n+b] - U_im[m,i] * str[2,str_i,n+b] C_im_m_n += U_re[m,i] * str[2,str_i,n+b] + U_im[m,i] * str[1,str_i,n+b] end C_re[m,n] = C_re_m_n C_im[m,n] = C_im_m_n end @avx for n ∈ 1:Nmax, m ∈ axes(U, 1) str_m = k + indices[m] str[1,str_m,n+b] = C_re[m,n] str[2,str_m,n+b] = C_im[m,n] end end end end return st end function subspace_mul4x4!(st::AbstractMatrix{T}, comspace, U::AbstractMatrix, subspace, offset=0) where T Base.Cartesian.@nextract 4 indices i -> comspace[i] + 1 Base.Cartesian.@nextract 4 U i->begin Base.Cartesian.@nextract 4 U_i j->U[i, j] end @inbounds for k in subspace Base.Cartesian.@nextract 4 idx i-> k + indices_i + offset for b in 1:size(st, 2) Base.Cartesian.@nexprs 4 i -> begin y_i = zero(T) Base.Cartesian.@nexprs 4 j -> begin y_i += U_i_j * st[idx_j, b] end end Base.Cartesian.@nexprs 4 i -> begin st[idx_i, b] = y_i end end end return st end function subspace_mul8x8!(st::AbstractMatrix{T}, comspace, U::AbstractMatrix, subspace, offset=0) where T Base.Cartesian.@nextract 8 indices i -> comspace[i] + 1 Base.Cartesian.@nextract 8 U i->begin Base.Cartesian.@nextract 8 U_i j->U[i, j] end @inbounds for k in subspace Base.Cartesian.@nextract 8 idx i-> k + indices_i + offset for b in 1:size(st, 2) Base.Cartesian.@nexprs 8 i -> begin y_i = zero(T) Base.Cartesian.@nexprs 8 j -> begin y_i += U_i_j * st[idx_j, b] end end Base.Cartesian.@nexprs 8 i -> begin st[idx_i, b] = y_i end end end return st end ```

I set the block size to 32 here, since I find it works better than 8 on my machine. But I'm not sure how should one tune this parameter.

Benchmark

4x4 benchmark ```julia n = 8 locs = Locations((1, 3)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 2, 1 << 2) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul4x4!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 44.049 μs (0.00% GC) median time: 45.199 μs (0.00% GC) mean time: 45.260 μs (0.00% GC) maximum time: 98.788 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 368 bytes allocs estimate: 3 -------------- minimum time: 145.357 μs (0.00% GC) median time: 169.142 μs (0.00% GC) mean time: 172.254 μs (0.00% GC) maximum time: 218.237 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 2.80 KiB allocs estimate: 5 -------------- minimum time: 91.348 μs (0.00% GC) median time: 93.869 μs (0.00% GC) mean time: 94.103 μs (0.00% GC) maximum time: 107.998 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 ``` 8x8 benchmark ```julia n = 8 locs = Locations((1, 3, 4)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 3, 1 << 3) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul8x8!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 100.238 μs (0.00% GC) median time: 102.818 μs (0.00% GC) mean time: 102.973 μs (0.00% GC) maximum time: 129.718 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 496 bytes allocs estimate: 3 -------------- minimum time: 203.286 μs (0.00% GC) median time: 234.566 μs (0.00% GC) mean time: 226.253 μs (0.00% GC) maximum time: 274.735 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 5.61 KiB allocs estimate: 5 -------------- minimum time: 72.709 μs (0.00% GC) median time: 74.358 μs (0.00% GC) mean time: 74.602 μs (0.00% GC) maximum time: 213.575 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 ``` 16x16 benchmark ```julia n = 8 locs = Locations((1, 3, 4, 5)) st = rand(ComplexF64, 1 << n, 100) U = rand(ComplexF64, 1 << 4, 1 << 4) subspace = bsubspace(n, locs) comspace = bcomspace(n, locs) julia> @benchmark subspace_mul!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 752 bytes allocs estimate: 3 -------------- minimum time: 421.922 μs (0.00% GC) median time: 438.516 μs (0.00% GC) mean time: 448.164 μs (0.00% GC) maximum time: 703.526 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark subspace_mul_avx!($(copy(st)), $comspace, $U, $subspace) BenchmarkTools.Trial: memory estimate: 12.70 KiB allocs estimate: 5 -------------- minimum time: 101.148 μs (0.00% GC) median time: 104.228 μs (0.00% GC) mean time: 104.850 μs (0.43% GC) maximum time: 2.390 ms (94.51% GC) -------------- samples: 10000 evals/sample: 1 ```

But I still have a few questions about the above implementation:

I tried to use transpose(rand(100, 1<<8)) instead of rand(1<<8, 100), based on previous discussion, however, stridedpointer complains

ERROR: ArgumentError: Parent must be strided.
Stacktrace:
  [1] |>(x::ArgumentError, f::typeof(throw))
    @ Base ./operators.jl:839
  [2] strides(a::Base.ReinterpretArray{Float64, 3, ComplexF64, LinearAlgebra.Transpose{ComplexF64, Matrix{ComplexF64}}, true})
    @ Base ./reinterpretarray.jl:157
  [3] strides
    @ ~/.julia/packages/ArrayInterface/q5Rwc/src/stridelayout.jl:287 [inlined]
  [4] bytestrides
    @ ~/.julia/packages/VectorizationBase/oXigE/src/strided_pointers/stridedpointers.jl:21 [inlined]
  [5] stridedpointer
    @ ~/.julia/packages/VectorizationBase/oXigE/src/strided_pointers/stridedpointers.jl:72 [inlined]

is it beneficial to be able to use the more contiguous layout in the last implementation? if it is, how can I use it with StrideArray?

In my benchmark, it seems Julia fails to allocate the tempries on stack, I tried to remove the subspace and comspace and use a constant indices like the previous benchmarks, but that didn't help. I'm running this on Julia-1.6, could this because of the updates in Julia-1.6? @chriselrod which julia version were you using?

chriselrod commented 3 years ago

In my benchmark, it seems Julia fails to allocate the tempries on stack, I tried to remove the subspace and comspace and use a constant indices like the previous benchmarks, but that didn't help. I'm running this on Julia-1.6, could this because of the updates in Julia-1.6? @chriselrod which julia version were you using?

I was using Julia 1.6. For the arrays to be stack allocated (in Julia -- Fortran can stack allocate dynamically sized arrays), the size has to be known at compile time. So all the dimensions of these temporaries:

    indices = StrideArray{Int}(undef, (length(comspace), ))
    @simd ivdep for i in 1:length(comspace)
        indices[i] = comspace[i] + 1
    end

    C_re = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}()))
    C_im = StrideArray{T}(undef, (PaddedMatrices.static_length(indices), StaticInt{32}()))
    U_re = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}())))
    U_im = StrideArray{T}(undef, (PaddedMatrices.size(U, StaticInt{1}()), PaddedMatrices.size(U, StaticInt{2}())))

must be known at compile time. If a dimension is an integer instead of a StaticInt:

julia> typeof(StrideArray{Float64}(undef, (4,5)).data)
Vector{Float64} (alias for Array{Float64, 1})

julia> typeof(StrideArray{Float64}(undef, (StaticInt(4),5)).data)
Vector{Float64} (alias for Array{Float64, 1})

julia> typeof(StrideArray{Float64}(undef, (StaticInt(4),StaticInt(5))).data)
PaddedMatrices.MemoryBuffer{20, Float64}

then the memory of a StrideArray is backed by a regular Vector instead of a mutable struct. mutable structs are stack allocated when they don't escape. PaddedMatrices.@gc_preserve uses GC.@preserve on the mutable struct, and then dumps the memory backing for all StrideArrays being passed to the following call, so that's the recommended approach of avoiding stack allocations when you need to pass them as function arguments but want to keep them stack allocated..

Because U is square, and the length of indices equals the number of rows/columns, I'd probably either picking one of these tw to be statically sized, or passing in a Val or StaticInt argument, and then being sure to use this static size everywhere/for all the initializations.

You could also use

julia> PaddedMatrices.ArrayInterface.indices((C,C,C,C,A), (StaticInt(1),StaticInt(2),StaticInt(2),StaticInt(1),StaticInt(2)))
Base.Slice(StaticInt{1}():StaticInt{1}():StaticInt{8}())

julia> PaddedMatrices.static_length(ans)
StaticInt{8}()

indices here takes a tuple of arrays, and then a tuple of which dim you want to iterate across. It checks that all the corresponding axes are equal, and also promotes dynamic -> static size, so that if any of the axes happen to be statically sized, it'd return a static range. You could of course implement simple checks yourself. You may also not want to skip checks on equality if you already know the axes are equal. Maybe we should wrap the equality checks in @boundscheck.

julia> using PaddedMatrices, StaticArrays

julia> A = @StrideArray rand(8,8);

julia> B = @MArray rand(8,8);

julia> C = rand(8,8);

julia> PaddedMatrices.size.((A, B, C))
((StaticInt{8}(), StaticInt{8}()), (StaticInt{8}(), StaticInt{8}()), (8, 8))

julia> PaddedMatrices.size.((A, B, C), StaticInt(1))
(StaticInt{8}(), StaticInt{8}(), 8)

julia> PaddedMatrices.size.((A, B, C), StaticInt(2))
(StaticInt{8}(), StaticInt{8}(), 8)

julia> PaddedMatrices.size === PaddedMatrices.ArrayInterface.size # size is defined in `ArrayInterface
true

julia> PaddedMatrices.size === Base.size
false

That also means that because your U was dynamically sized:

@avx for n ∈ 1:32, m ∈ axes(U, 1)
    C_re_m_n = zero(T)
    C_im_m_n = zero(T)
    for i ∈ axes(U, 2)

both the m ∈ axes(U, 1) and for i ∈ axes(U, 2) loops were dynamically sized.

This probably makes a sizeable difference at small sizes. E.g., the dynamically sized arrays were much slower than the the fixed size array at 4x4 in the PaddedMatrices README. In the first chart, the green line (jmul!) was dynamic, while the orange and blue (PtrArray and StrideArray) were static. The other two charts have outdated names, but the DynamicMatmul was dynamic and FixedSizeArray, PtrArray, and PaddedArray were static. PaddedMatrices does introduce a small amount of overhead on top of raw @avx, so the difference at 4x4 would be a bit smaller if all you do is that.

So my suggestions are:

Make sure all the temporaries are statically sized and stack allocated.
Make all the loops statically sized when possible. Assuming you've done 1., you could then simply make the appropriate change in all six occurances:
```
@avx for n ∈ 1:32, m ∈ axes(U_re, 1)
C_re_m_n = zero(T)
C_im_m_n = zero(T)
for i ∈ axes(U_re, 2)
```

This may have also influenced the block sizes you found to be optimal. But, honestly, I only tried 8. So maybe bigger is better.

I have to change the implementation a bit since I need to generate the subspace indices on the fly (the subspace and comspace function), I'll post a repo link later for the utils of creating subspace object so one can reproduce the benchmark I list here.

Okay, I'll try if you provide those functions.

Roger-luo commented 3 years ago

Thanks! It took me a while to wrap up things, but I just posted them in a repo - it will be the next gen simulator backend for Yao.jl and perhaps also will be used by OMEinsum: https://github.com/Roger-luo/BQCESubroutine.jl

I realize this could be quite useful in various areas and since it's not something implemented by BLAS, it can be quite a nice show case for LoopVectorization.

You can find top level routine broutine!'s benchmark here: https://github.com/Roger-luo/BQCESubroutine.jl/blob/master/benchmarks/instruct.jl

regarding this issue, you can find the benchmark scripts here: https://github.com/Roger-luo/BQCESubroutine.jl/blob/master/benchmarks/subspace_mm.jl where the bcomspace and bsubspace functions are provided by BQCESubroutine.Utilities module.

I'm not sure if it's the best way, but I currently force a re-compilation on the size of U here: https://github.com/Roger-luo/BQCESubroutine.jl/blob/master/src/mul.jl#L22

I also changed the interface convention of st when it's a matrix - so now the b dimension is always the first dimension and is continuous, so I also changed the implementation of yours accordingly. This also makes the StaticArray version faster.

But I find @avx version of subspace_mul_generic! is still slower at small U size which is 4x4 comparing to the manually unrolled version. On the other hand, when U is of size 8x8 it seems to have no improvement comparing to SA version (since SA version is now faster by changing the layout). I'm not sure if there are still space for optimizing its performance, but any other size seems to be pretty good now.

Roger-luo commented 3 years ago

@chriselrod so I just fixed the implementation in BQCESubroutine (need to use master branch), however, I realize currently I can only do multithreading in the outermost loop at the moment with the current implementation, since the @avx is called inside the outer loop, but there are a case when the first dimension is larger, then to make use of all the threads one may want to also make the second loop (the B loop: https://github.com/QuantumBFS/BQCESubroutine.jl/blob/master/src/mul.jl#L146), I'm wondering if I could use @avxt in the latest LoopVectorization for this somehow? I'm not sure what has been improved or if there are any of the limitations listed above got resolved.

e.g I see in the example convlayer one can use @avxt on multiple level of for loops

Roger-luo commented 3 years ago

I tried to remove the reinterpret and use a 3-dim StrideArray directly then reinterpret that later outside (to make the multithreaded code simpler), so I get the following implementation, however this is slower comparing to the original one, and somehow even I moved the temporary array C out and make every dim static, this function still allocates...

using StrideArrays
using ArrayInterface
using LoopVectorization
using YaoLocations
using BenchmarkTools
using BQCESubroutine
using StrideArrays
using BQCESubroutine.Utilities
using LoopVectorization
using ThreadingUtilities
using BQCESubroutine.Utilities: BitSubspace

function subspace_mul_generic!(S::AbstractStrideArray{<:Any, <:Any, T, 3}, indices, C, U::AbstractStrideArray{<:Any, <:Any, T, 3}, subspace, offset=0) where {T <: Base.HWReal}
    Bmax = size(S, 2)
    for s in 1:length(subspace)
        k = subspace[s]
        # idx = k .+ indices
        # _k = k - 1
        for _b ∈ 0:(Bmax-1) >>> 3
            b =    _b << 3;
            bmax = b + 8
            if bmax ≤ Bmax # full static block
                @avx for n ∈ 1:8, m ∈ axes(U, 2)
                    C_re_m_n = zero(T)
                    C_im_m_n = zero(T)
                    for i ∈ axes(U, 3)
                        j = k + indices[i] + offset
                        C_re_m_n += U[1,m,i] * S[1,n+b,j] - U[2,m,i] * S[2,n+b,j]
                        C_im_m_n += U[1,m,i] * S[2,n+b,j] + U[2,m,i] * S[1,n+b,j]
                    end
                    C[1,m,n] = C_re_m_n
                    C[2,m,n] = C_im_m_n
                end
                @avx for n ∈ 1:8, m ∈ axes(U, 2)
                    j = k + indices[m] + offset
                    S[1,n+b,j] = C[1,m,n]
                    S[2,n+b,j] = C[2,m,n]
                end
                # AmulB!(C_re, C_im, U_re, U_im, 
            else # dynamic block
                Nmax = 8 + Bmax - bmax
                @avx for n ∈ 1:Nmax, m ∈ axes(U, 2)
                    C_re_m_n = zero(T)
                    C_im_m_n = zero(T)
                    for i ∈ axes(U, 3)
                        j = k + indices[i] + offset
                        C_re_m_n += U[1,m,i] * S[1,n+b,j] - U[2,m,i] * S[2,n+b,j]
                        C_im_m_n += U[1,m,i] * S[2,n+b,j] + U[2,m,i] * S[1,n+b,j]
                    end
                    C[1,m,n] = C_re_m_n
                    C[2,m,n] = C_im_m_n
                end
                @avx for n ∈ 1:Nmax, m ∈ axes(U, 2)
                    j = k + indices[m] + offset
                    S[1,n+b,j] = C[1,m,n]
                    S[2,n+b,j] = C[2,m,n]
                end
            end
        end
    end
    return S
end

S = @StrideArray rand(Float64, 2, 100, 1<<20)
U = @StrideArray rand(Float64, 2, 8, 8)
C = StrideArray{Float64}(undef, (StaticInt{2}(), ArrayInterface.size(U, 2), StaticInt{8}()))
locs = Locations((1, 3, 5))
subspace = bsubspace(20, locs)
comspace = bcomspace(20, locs)
indices = StrideArray{Int}(undef, (StaticInt{length(comspace)}(), ))
@simd ivdep for i in eachindex(indices)
    indices[i] = comspace[i] + 1
end

@benchmark subspace_mul_generic!($(copy(S)), indices, C, U, subspace)

chriselrod commented 3 years ago

and somehow even I moved the temporary array C out and make every dim static, this function still allocates...

julia> @benchmark subspace_mul_generic!($(copy(S)), indices, C, U, subspace)
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     196.039 ms (0.00% GC)
  median time:      199.733 ms (0.00% GC)
  mean time:        200.059 ms (0.00% GC)
  maximum time:     206.663 ms (0.00% GC)
  --------------
  samples:          25
  evals/sample:     1

julia> @benchmark subspace_mul_generic!($(copy(S)), $indices, $C, $U, $subspace)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     196.955 ms (0.00% GC)
  median time:      200.818 ms (0.00% GC)
  mean time:        201.524 ms (0.00% GC)
  maximum time:     208.475 ms (0.00% GC)
  --------------
  samples:          25
  evals/sample:     1

Sometimes not interpolating doesn't allocate. Sometimes it does.

Have a recent version splitting U to compare with?

I was extremely concerned when I noticed it now takes 200 milliseconds, while the earlier reported times were all around 200 microseconds, but seems we're now doing 4096 times the work?

julia> length(subspace)
131072

julia> subspace_old = [0, 2, 16, 18, 32, 34, 48, 50, 64, 66, 80, 82, 96, 98, 112, 114, 128, 130, 144, 146, 160, 162, 176, 178, 192, 194, 208, 210, 224, 226, 240, 242];

julia> length(subspace_old)
32

julia> length(subspace) / ans
4096.0

LoopVectorization seems to be doing the right thing here based on

julia> ls = let k = 1, offset = 1, b = 1
           LoopVectorization.@avx_debug for n ∈ 1:8, m ∈ axes(U, 2)
                           C_re_m_n = zero(T)
                           C_im_m_n = zero(T)
                           for i ∈ axes(U, 3)
                               j = k + indices[i] + offset
                               C_re_m_n += U[1,m,i] * S[1,n+b,j] - U[2,m,i] * S[2,n+b,j]
                               C_im_m_n += U[1,m,i] * S[2,n+b,j] + U[2,m,i] * S[1,n+b,j]
                           end
                           C[1,m,n] = C_re_m_n
                           C[2,m,n] = C_im_m_n
                      end; end

at least on my computer. But there could be codegen problems in VectorizationBase.

Note that it still has to shuffle on loads from U and stores to C.

LoopVectorization is going to do the wrong thing here, though:

                @avx for n ∈ 1:8, m ∈ axes(U, 2)
                    j = k + indices[m] + offset
                    S[1,n+b,j] = C[1,m,n]
                    S[2,n+b,j] = C[2,m,n]
                end

in that it'll shuffle on loads from C and shuffle again on stores to S. The shuffling isn't necessary. Small chance LLVM is smart enough to see through this and remove the shuffles, but I wouldn't bet on it.

My suggested fix here is to make C be M x 2 x N so that you no longer need to shuffle on the stores to/loads from C, instead only shuffling on the store to S.

Because we're still shuffling on the loads from U, it may also still be worth splitting U into U_re and U_im, or into (like my suggestion for C, make it a 3d array with the second axis real vs imaginary instead of the first).

Shuffling is much faster than gather/scatter like it used to do, but it's still slower than contiguous stores. So the question is at what point do we reuse loads from U often enough that it becomes profitable to create the temporary.

chriselrod commented 3 years ago

@Roger-luo If you have the FMA instruction set, this should be fast:

julia> @time using StaticArrays
@time using Stride  2.432350 seconds (3.14 M allocations: 218.145 MiB, 2.46% gc time)

julia> @time using StrideArrays, LoopVectorization
  9.346095 seconds (7.05 M allocations: 407.005 MiB, 4.09% gc time, 13.52% compilation time)

julia> function cmatmul_array_v2!(Cc::AbstractMatrix{Complex{T}}, Ac::AbstractMatrix{Complex{T}}, Bc::AbstractMatrix{Complex{T}}) where {T}
         C = reinterpret(Float64, Cc);
         A = reinterpret(Float64, Ac);
         B = reinterpret(reshape, Float64, Bc);
         @turbo vectorize=2 for n ∈ indices((C,B),(2,3)), m ∈ indices((C,A),1)
             Cmn = zero(T)
             for k ∈ indices((A,B),(2,2))
                 Amk = A[m,k]
                 Aperm = vpermilps177(Amk)
                 Cmn = vfmaddsub(Amk, B[1,k,n], vfmaddsub(Aperm, B[2,k,n], Cmn))
             end
             C[m,n] = Cmn
         end
         return Cc
     end
Ascmatmul_array_v2! (generic function with 1 method)

julia> As4x4 = @SMatrix rand(4,4);

julia> Bs4x4 = @SMatrix rand(4,4);

julia> Ax4 = StrideArray(MArray(As4x4));

julia> Bx4 = StrideArray(MArray(As4x4));

julia> As4x4 = @SMatrix rand(Complex{Float64},4,4);

julia> Bs4x4 = @SMatrix rand(Complex{Float64},4,4);

julia> Ax4 = StrideArray(MArray(As4x4));

julia> Bx4 = StrideArray(MArray(Bs4x4));

julia> Cx4 = similar(Ax4);

julia> cmatmul_array_v2!(Cx4, Ax4, Bx4)
4×4 StrideArraysCore.StaticStrideArray{Tuple{StaticInt{4}, StaticInt{4}}, (true, true), ComplexF64, 2, 1, 0, (1, 2), Tuple{StaticInt{16}, StaticInt{64}}, Tuple{StaticInt{1}, StaticInt{1}}, 16}:
 -0.206183+1.25759im    0.085368+1.3882im    -0.187353+1.02114im   -0.0220778+2.66172im
  0.167889+1.26523im    0.762005+1.76983im   0.0499955+0.905538im    0.476644+2.13508im
  0.443334+1.56142im    0.712381+2.25556im   0.0856212+1.62122im   -0.0968986+3.11118im
 -0.268315+0.769231im  -0.348635+0.903132im  -0.294804+0.568038im   -0.041956+0.731687im

julia> As4x4 * Bs4x4
4×4 SMatrix{4, 4, ComplexF64, 16} with indices SOneTo(4)×SOneTo(4):
 -0.206183+1.25759im    0.085368+1.3882im    -0.187353+1.02114im   -0.0220778+2.66172im
  0.167889+1.26523im    0.762005+1.76983im   0.0499955+0.905538im    0.476644+2.13508im
  0.443334+1.56142im    0.712381+2.25556im   0.0856212+1.62122im   -0.0968986+3.11118im
 -0.268315+0.769231im  -0.348635+0.903132im  -0.294804+0.568038im   -0.041956+0.731687im

julia> @benchmark $(Ref(As4x4))[] * $(Ref(Bs4x4))[]
BechmarkTools.Trial: 10000 samples with 898 evaluations.
 Range (min … max):  124.272 ns … 227.732 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     125.772 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   127.029 ns ±   6.736 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▄▆▇█▇▆▅▄▂▅▁▄   ▂▂▂▃▃▃▃▂▁▂                                    ▂
  ██████████████▇██████████████▅▅▅▃▃▁▄▃▅▄▅▅▃▄▄▄▁▄▅▅▃▅▄▄▄▅▃▁▃▁▃▄ █
  124 ns        Histogram: log(frequency) by time        142 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark cmatmul_array_v2!($Cx4, $Ax4, $Bx4)
BechmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  25.990 ns … 47.732 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.046 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.376 ns ±  2.249 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▁▃▁▃▁▁▁▃▁▁▆▆▄▄▁▃▄▃▁▁▄▄▅▄▄▄▅▄▃▄▄▃▁▃▃▁▄▁▃▃▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▃▃█ █
  26 ns        Histogram: log(frequency) by time      43.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.6.2-pre.59
Commit 134c34397c (2021-07-01 10:27 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)

chriselrod commented 3 years ago

Zulip discussion/explanation here.

AVX512 benchmarks:

julia> @benchmark $(Ref(As4x4))[] * $(Ref(Bs4x4))[]
BechmarkTools.Trial: 10000 samples with 989 evaluations.
 Range (min … max):  49.047 ns … 92.030 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.218 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.332 ns ±  1.035 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▄▇██▆▄▁   ▂▃▄▃▁                              ▁▁▁         ▂
  ▅▇████████▆▇▇█████▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅▇▇██████▇▄▆▇▆▇▆ █
  49 ns        Histogram: log(frequency) by time      50.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark cmatmul_array_v2!($Cx4, $Ax4, $Bx4)
BechmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.068 ns … 42.729 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.106 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.128 ns ±  0.544 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                  ▁▂▄▅▅▇ ▇█▇▆▅▄▃▂▁
  ▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▂▂▂▃▃▄▄▅▇███████▁█████████▇▆▅▄▃▃▃▃▂▂ ▄
  770 ns         Histogram: frequency by time        7.12 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.6.2-pre.2
Commit ff1827d117* (2021-05-02 02:37 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)

chriselrod commented 3 years ago

Note that this implementation is however much slower on CPUs that don't have the FMA instruction set (i.e., specifically the FMA instruction set, as that provides the vfmaddsub instruction). I'll eventually get LoopVectorization to do this automatically so that code can still be written in an architecture-generic manner.

Roger-luo commented 3 years ago

thanks! this is amazing! is there a way to detect FMA instruction set as a global option during precompile then? I think I could also just put this to a @static macro so that on FMA machines we get the maximal performance.

chriselrod commented 3 years ago

thanks! this is amazing! is there a way to detect FMA instruction set as a global option during precompile then? I think I could also just put this to a @static macro so that on FMA machines we get the maximal performance.

The approach I've been using is

using VectorizationBase
function foo(x, y, z)
   foo_maybefma(x, y, z, VectorizationBase.has_feature(Val(:x86_64_fma)))
end
function foo_maybefma(x, y, z, ::True)
    # implementation requiring FMA
end
function foo_maybefma(x, y, z, ::False)
    # implementation not requiring FMA
end

When you using VectorizationBase, it checks whether every has_feature method is correct, and then @eval updates them if they're wrong. The motivation for this approach is that the CPU that precompiles might not be the CPU that runs the code later. The two most common examples are: 1) Distributing system images. 2) Beowulf cluster with a network filesystem.

True and False are types defined in Static.jl, so dispatch/methods are all still resolved at compile time; if compiled into a system image, these methods should get invalidated if any has_feature they depend on gets redefined.

JuliaSIMD / LoopVectorization.jl

@avx in subspace matrix multiplication is much slower than StaticArrays #169