Allow unrolling an arbitrary number of loops (or at least up to 4) and improve load elimination

I've simplified the lowering code as a first step, but it still expects input in terms of unrolled and tiled. This should be replaced with a vector of integers indicating by how much each loop is unrolled (1 means not unrolled).

The more complicated step will be generalizing the constrained optimization problem within solve_tilesize to an arbitrary number of dimensions.

I think I'll have it handle unrolling a single loop as a special case. The solve_tilesize function tries to occupy all the registers, which wouldn't really be appropriate for a single loop.

It also needs the ability to recognize more kinds of loads that can be eliminated. In particular, it should be able to realize that if an array is accessed with both A[i] and A[i+1], unrolling i will only require a single extra load, not 2 extra loads, because A[i+1] == A[(i+1)].

The motivating example was inspired by @timholy , although before LoopVectorization is going to see major benefits over the current performance, it will have to be able to realize t

Most of the code here is also in the offset array tests.

using LoopVectorization, OffsetArrays, BenchmarkTools, Test
function old2d!(out::AbstractMatrix, A::AbstractMatrix, kern)
    rng1k, rng2k = axes(kern)
    rng1,  rng2  = axes(out)
    for j in rng2, i in rng1
        tmp = zero(eltype(out))
        for jk in rng2k, ik in rng1k
            tmp += A[i+ik,j+jk]*kern[ik,jk]
        end
        out[i,j] = tmp
    end
    out
end
function avx2douter!(out::AbstractMatrix, A::AbstractMatrix, kern)
    rng1k, rng2k = axes(kern)
    rng1,  rng2  = axes(out)
    @avx for j in rng2, i in rng1
        tmp = zero(eltype(out))
        for jk in rng2k, ik in rng1k
            tmp += A[i+ik,j+jk]*kern[ik,jk]
        end
        out[i,j] = tmp
    end
    out
end

T = Float32
A = rand(T, 100, 100);
kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);
out1 = OffsetArray(similar(A, size(A).-2), 1, 1);   # stay away from the edges of A
out2 = similar(out1); out3 = similar(out1);

@avx yields a substantial improvement:

julia> @benchmark old2d!($out1, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     64.591 μs (0.00% GC)
  median time:      64.655 μs (0.00% GC)
  mean time:        65.363 μs (0.00% GC)
  maximum time:     88.856 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark avx2douter!($out2, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.429 μs (0.00% GC)
  median time:      3.451 μs (0.00% GC)
  mean time:        3.458 μs (0.00% GC)
  maximum time:     8.495 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @test out1 ≈ out2
Test Passed

This yields a substantial improvement, but we'd naturally always wonder if we can do better. Let's find out what it's doing.

julia> q2d = :(for j in rng2, i in rng1
       tmp = zero(eltype(out))
       for jk in rng2k, ik in rng1k
       tmp += A[i+ik,j+jk]*kern[ik,jk]
       end
       out[i,j] = tmp
       end);

julia> lsq2d = LoopVectorization.LoopSet(q2d); LoopVectorization.choose_order(lsq2d)
([:j, :i, :jk, :ik], :ik, :j, :i, 4, 4)

This means that it orders the loops [:j, :i, :jk, :ik] (which happens to be the same order in which they were written, but this is not true in general). The next two symbols indicate that ik and j are unrolled (4x and 4x, respectively), while i is "vectorized". These numbers are machine-dependent, so you're likely to at least see different unroll factors if you don't have avx512 (i.e., something other than 4 and 4).

The most obvious problem we may notice here is that the ik loop is only 3 iterations long, so unrolling by 4 is excessive. We can improve that with static size information:

using LoopVectorization.VectorizationBase: StaticUnitRange
struct SizedOffsetMatrix{T,LR,UR,LC,RC} <: AbstractMatrix{T}
    data::Matrix{T}
end
Base.axes(::SizedOffsetMatrix{T,LR,UR,LC,UC}) where {T,LR,UR,LC,UC} = (StaticUnitRange{LR,UR}(),StaticUnitRange{LC,UC}())
@generated function LoopVectorization.stridedpointer(A::SizedOffsetMatrix{T,LR,UR,LC,RC}) where {T,LR,UR,LC,RC}
    quote
        $(Expr(:meta,:inline))
        LoopVectorization.OffsetStridedPointer(
            LoopVectorization.StaticStridedPointer{$T,Tuple{1,$(UR-LR+1)}}(pointer(A.data)),
            ($(LR-2), $(LC-2))
        )
    end
end
Base.getindex(A::SizedOffsetMatrix, i, j) = LoopVectorization.vload(LoopVectorization.stridedpointer(A), (i,j)) # only needed to print
skern = SizedOffsetMatrix{T,-1,1,-1,1}(parent(kern));

Getting show to work off the AbstractArray interface would take some more methods.

To see that this works, we'll have to inspect a LoopSet object created with type information (not from an expression). The @avx_debug macro lets us do this, but requires all the symbols in the loop to be defined (otherwise, which object's types will it be reading?)

julia> rng1,  rng2  = CartesianIndices(out1).indices;

julia> rng1k, rng2k = axes(skern);

julia> ls2dstatic = LoopVectorization.@avx_debug for j in rng2, i in rng1
       tmp = zero(eltype(out))
       for jk in rng2k, ik in rng1k
       tmp += A[i+ik,j+jk]*skern[ik,jk]
       end
       out1[i,j] = tmp
       end;
OPS = Tuple{:numericconstant,Symbol("##zero#1278"),LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:i,:i,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x02),:ik,:ik,LoopVectorization.OperationStruct(0x0000000000000004, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x03),:LoopVectorization,:+,LoopVectorization.OperationStruct(0x0000000000000024, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04),:j,:j,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x05),:jk,:jk,LoopVectorization.OperationStruct(0x0000000000000003, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x06),:LoopVectorization,:+,LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000506, LoopVectorization.compute, 0x00, 0x07),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000002413, 0x0000000000000000, 0x0000000000000000, 0x0000000000000407, LoopVectorization.memload, 0x01, 0x08),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000043, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x09),:numericconstant,Symbol("##reductzero#1290"),LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000043, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x0a),:LoopVectorization,:vfmadd,LoopVectorization.OperationStruct(0x0000000000002413, 0x0000000000000043, 0x0000000000000000, 0x000000000008090a, LoopVectorization.compute, 0x00, 0x0a),:LoopVectorization,:reduce_to_add,LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000043, 0x0000000000000000, 0x0000000000000b01, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x000000000000000c, LoopVectorization.memstore, 0x03, 0x0b)}
ARF = Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000202, 0x0000000000000407, 0xffffffffffffffde),LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000403, 0xffffffffffffff80),LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000201, 0xffffffffffffff8f)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(1, LoopVectorization.IntOrFloat),(10, LoopVectorization.IntOrFloat)},Tuple{}}
LB = Tuple{UnitRange{Int64},UnitRange{Int64},StaticUnitRange{-1,1},StaticUnitRange{-1,1}}
vargs = (VectorizationBase.PackedStridedPointer{Float32,1}(Ptr{Float32} @0x000055dde45035c0, (100,)), LoopVectorization.OffsetStridedPointer{Float32,2,VectorizationBase.StaticStridedPointer{Float32,Tuple{1,3}}}(VectorizationBase.StaticStridedPointer{Float32,Tuple{1,3}}(Ptr{Float32} @0x00007fd441216e50), (-3, -3)), LoopVectorization.OffsetStridedPointer{Float32,2,VectorizationBase.PackedStridedPointer{Float32,1}}(VectorizationBase.PackedStridedPointer{Float32,1}(Ptr{Float32} @0x000055dde34cd0c0, (98,)), (0, 0)))

julia> LoopVectorization.choose_order(ls2dstatic)
([Symbol("##n#1299"), Symbol("##n#1302"), Symbol("##n#1303"), Symbol("##n#1304")], Symbol("##n#1304"), Symbol("##n#1299"), Symbol("##n#1302"), 3, 4)

It gensyms names, but we can figure out that Symbol("##n#1304") is the ik loop as before, but it is now unrolled by a factor of 3, matching the length of this loop.

LoopVectorization's optimizer uses this static size information, but much of the lowering does not: it'll still generate code to handle other sizes. Addressing that's been a low priority, because I've found LLVM to be very good at eliminating all the code handling other sizes.

This yields a sizeable performance improvement:

julia> @benchmark avx2douter!($out3, $A, $skern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.713 μs (0.00% GC)
  median time:      2.726 μs (0.00% GC)
  mean time:        2.731 μs (0.00% GC)
  maximum time:     5.395 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @test out1 ≈ out3
Test Passed

Yet, I think ideally, it ought to be much more aggressive.

It's important to emphasize that most modern CPU's throughput for unaligned loads is half that of fma instructions. This means that naively following how the function is written, where you load from both A and kern for each fma, it will spend up to 4 times longer loading memory than it will computing arithmetic. Loading and arithmetic don't really interfere, so the time is going to be governed largely by memory loading; anything we can do will help.

Given static size information, it should completely unroll the small -1:1 loops, and then additionally unroll the other loops to save on redundant loads. Then we know for sure it is not reloading kern, and the additional unrolling on the i and j loops will yield savings because of all the reuse we get since many of them will be the same across iterations.

function avx2dunrolled!(out::AbstractMatrix, A::AbstractMatrix, kern::SizedOffsetMatrix{T,-1,1,-1,1}) where {T}
    rng1,  rng2  = axes(out)
    Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> kern_ik_jk = kern[ik-2,jk-2]            
    @avx for j in rng2, i in rng1
        tmp_0 = zero(eltype(out))
        Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> tmp_{ik+(jk-1)*3} = A[i+(ik-2),j+(jk-2)] * kern_ik_jk + tmp_{ik+(jk-1)*3-1}
        out[i,j] = tmp_9
    end
    out
end

This yields

julia> @benchmark avx2dunrolled!($out3, $A, $skern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.990 μs (0.00% GC)
  median time:      1.997 μs (0.00% GC)
  mean time:        2.001 μs (0.00% GC)
  maximum time:     3.033 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @test out1 ≈ out3
Test Passed

Another big improvement. But, LoopVectorization doesn't realize how aggressive it can get yet, due to not itself realizing that many of the loads alias. LLVM does, so it should eliminate redundant loads and give us the performance benefits, even if we cheat to make LoopVectorization be more aggressive.

function avx2dunrolled4x4!(out::AbstractMatrix, A::AbstractMatrix, kern::SizedOffsetMatrix{T,-1,1,-1,1}) where {T}
    rng1,  rng2  = axes(out)
    Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> kern_ik_jk = kern[ik-2,jk-2]
    @avx tile=(4,4) for j in rng2, i in rng1
        tmp_0 = zero(eltype(out))
        Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> tmp_{ik+(jk-1)*3} = A[i+(ik-2),j+(jk-2)] * kern_ik_jk + tmp_{ik+(jk-1)*3-1}
        out[i,j] = tmp_9
    end
    out
end

I'm specifying 4,4 here, and performance looks quite good. Without AVX512, it is likely that smaller numbers will give you better performance. The generated assembly shows lots of chunks that look like this:

        vfmadd213ps     %zmm18, %zmm26, %zmm27 # zmm27 = (zmm26 * zmm27) + zmm18
        vfmadd213ps     %zmm18, %zmm28, %zmm21 # zmm21 = (zmm28 * zmm21) + zmm18
        vfmadd231ps     56(%r8,%rcx), %zmm10, %zmm1 # zmm1 = (zmm10 * mem) + zmm1
        vfmadd231ps     %zmm10, %zmm8, %zmm25 # zmm25 = (zmm8 * zmm10) + zmm25
        vfmadd231ps     %zmm10, %zmm19, %zmm27 # zmm27 = (zmm19 * zmm10) + zmm27
        vfmadd231ps     %zmm10, %zmm23, %zmm21 # zmm21 = (zmm23 * zmm10) + zmm21
        vfmadd231ps     60(%r8,%rcx), %zmm11, %zmm1 # zmm1 = (zmm11 * mem) + zmm1
        vfmadd231ps     %zmm11, %zmm24, %zmm25 # zmm25 = (zmm24 * zmm11) + zmm25
        vfmadd231ps     %zmm11, %zmm20, %zmm27 # zmm27 = (zmm20 * zmm11) + zmm27
        vfmadd231ps     %zmm11, %zmm22, %zmm21 # zmm21 = (zmm22 * zmm11) + zmm21
        vfmadd231ps     %zmm12, %zmm26, %zmm25 # zmm25 = (zmm26 * zmm12) + zmm25
        vfmadd231ps     %zmm12, %zmm28, %zmm27 # zmm27 = (zmm28 * zmm12) + zmm27
        vfmadd213ps     %zmm1, %zmm12, %zmm2 # zmm2 = (zmm12 * zmm2) + zmm1
        vmovups 52(%r8,%rbp), %zmm1
        vmovups 56(%r8,%rbp), %zmm29
        vmovups 60(%r8,%rbp), %zmm30
        vfmadd231ps     %zmm12, %zmm1, %zmm21 # zmm21 = (zmm1 * zmm12) + zmm21
        vfmadd231ps     %zmm13, %zmm19, %zmm25 # zmm25 = (zmm19 * zmm13) + zmm25
        vfmadd213ps     %zmm2, %zmm13, %zmm8 # zmm8 = (zmm13 * zmm8) + zmm2
        vfmadd231ps     %zmm13, %zmm23, %zmm27 # zmm27 = (zmm23 * zmm13) + zmm27
        vfmadd231ps     %zmm13, %zmm29, %zmm21 # zmm21 = (zmm29 * zmm13) + zmm21
        vfmadd231ps     %zmm14, %zmm20, %zmm25 # zmm25 = (zmm20 * zmm14) + zmm25
        vfmadd231ps     %zmm14, %zmm22, %zmm27 # zmm27 = (zmm22 * zmm14) + zmm27
        vfmadd231ps     %zmm14, %zmm30, %zmm21 # zmm21 = (zmm30 * zmm14) + zmm21
        vfmadd213ps     %zmm8, %zmm14, %zmm24 # zmm24 = (zmm14 * zmm24) + zmm8
        vfmadd213ps     %zmm24, %zmm15, %zmm26 # zmm26 = (zmm15 * zmm26) + zmm24
        vfmadd213ps     %zmm25, %zmm15, %zmm28 # zmm28 = (zmm15 * zmm28) + zmm25
        vfmadd213ps     %zmm27, %zmm15, %zmm1 # zmm1 = (zmm15 * zmm1) + zmm27
        vfmadd231ps     52(%r8,%rbx), %zmm15, %zmm21 # zmm21 = (zmm15 * mem) + zmm21
        vfmadd213ps     %zmm28, %zmm16, %zmm23 # zmm23 = (zmm16 * zmm23) + zmm28
        vfmadd213ps     %zmm26, %zmm16, %zmm19 # zmm19 = (zmm16 * zmm19) + zmm26
        vfmadd213ps     %zmm1, %zmm16, %zmm29 # zmm29 = (zmm16 * zmm29) + zmm1
        vfmadd231ps     56(%r8,%rbx), %zmm16, %zmm21 # zmm21 = (zmm16 * mem) + zmm21
        vfmadd213ps     %zmm23, %zmm17, %zmm22 # zmm22 = (zmm17 * zmm22) + zmm23
        vfmadd213ps     %zmm29, %zmm17, %zmm30 # zmm30 = (zmm17 * zmm30) + zmm29
        vfmadd231ps     60(%r8,%rbx), %zmm17, %zmm21 # zmm21 = (zmm17 * mem) + zmm21
        vfmadd213ps     %zmm19, %zmm17, %zmm20 # zmm20 = (zmm17 * zmm20) + zmm19

Mountains of vectorized arithmetic. (Note however that many of the vfmadd___ps instructions access mem)

julia> @benchmark avx2dunrolled4x4!($out3, $A, $skern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.649 μs (0.00% GC)
  median time:      1.667 μs (0.00% GC)
  mean time:        1.670 μs (0.00% GC)
  maximum time:     2.788 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @test out1 ≈ out3
Test Passed

Finally, we can align the loads and stores by spacing the columns of A and out at multiples of 64-byte intervals (64 bytes is 16 x Float32):

julia> Av = view(similar(A, 112, 100), 1:100, :) .= A;

julia> outv = OffsetArray(view(similar(A, 112, 98), 1:98, :), 1, 1);

julia> @benchmark avx2dunrolled4x4!($outv, $Av, $skern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.543 μs (0.00% GC)
  median time:      1.553 μs (0.00% GC)
  mean time:        1.555 μs (0.00% GC)
  maximum time:     2.779 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @test out1 ≈ outv
Test Passed

That's more than 40x the original version without LoopVectorization! But it'd be great if we could get this performance automatically (or, at least automatically given size information), without the need for both Base.Cartesian unrolling and manually specifying additional unrolling behavior on top of it.

Dang! A real tour de force. I had only just started thinking along similar lines, but had gotten nowhere close to where you've arrived. I think ImageFiltering could easily have blocks that look like

    if all(s->s <= 9, size(kern))
        wrappedkernel = SizedOffsetArray{params...}(kern)
        _imfilter!(out, img, wrappedkernel)
    else
        _imfilter!(out, img, kernel)
    end

There's going to be dynamic dispatch for the Sized case but that's probably not a problem in practice unless people are filtering really tiny arrays. We could even pull out very common cases (e.g., axes are (-1:1, -1:1)) and do what is essentially manual union-splitting.

Is it worth asking whether it makes sense to just provide static size information about the first dimension of the kernel? That would make it easy to have a block like

    if axes(kern, 1) === Base.IdentityUnitRange(-1:1)
        _imfilter!(out, img, StaticSize1{-1,1}(kern))
    if axes(kern, 1) === Base.IdentityUnitRange(0:2)
        _imfilter!(out, img, StaticSize1{0,2}(kern))
    if axes(kern, 1) === Base.IdentityUnitRange(-2:0)
        _imfilter!(out, img, StaticSize1{-2,0}(kern))
    elseif  axes(kern, 1) === Base.IdentityUnitRange(-2:2)
        _imfilter!(out, img, StaticSize1{-2,2}(kern))
    elseif ...
    else ...
    end

and it would never need dynamic dispatch.

My suspicion is that, given your demo above, this particular use of dynamic dispatch is not worth worrying about and we should provide static information for any reasonably-small kernel. We're talking microseconds vs nanoseconds.

I added benchmarks over a size range of 2:256 for out using double precision. The unrolled avx function was just like avx2dunrolled! above, meaning I did not add a tile= argument.

Regular/Base julia actually performed equally well, as did all the other compilers, with the exception of ifort. I guess none of them did anything fancy (i.e., block).

Because (for now), the base Julia version is equally fast when completely unrolled, compiles much more quickly, and does not require any dependencies, it's worth considering. For reference, the benchmarked function was (source):

function filter2dunrolled!(out::AbstractMatrix, A::AbstractMatrix, kern::SizedOffsetMatrix{T,-1,1,-1,1}) where {T}
    rng1,  rng2  = axes(out)
    Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> kern_ik_jk = kern[ik-2,jk-2]
    @inbounds for j in rng2
        @simd ivdep for i in rng1
            tmp_0 = zero(eltype(out))
            Base.Cartesian.@nexprs 3 jk -> Base.Cartesian.@nexprs 3 ik -> tmp_{ik+(jk-1)*3} =  Base.FastMath.add_fast(Base.FastMath.mul_fast(A[i+(ik-2),j+(jk-2)], kern_ik_jk), tmp_{ik+(jk-1)*3-1})
            out[i,j] = tmp_9
        end
    end
    out
end

so an @generated function _imfilter! could see a lot of improvement as is for all the small kernel sizes. Once I make the changes proposed in this issue, @avx should improve to be more inline with the tile= versions. Or just using it now and specifying the tile= argument should yield better performance.

I like the idea of manual union splitting here. I think it would be best to pass both axis parameters, because we want small kernels to be completely unrolled. Although, as you note, nanoseconds penalty for a dynamic dispatch that earns microseconds of saving is well worth it.

Compile times could be an issue if kernel sizes change often.

Also, for some reason, the unrolled versions are all about half a microsecond slower today (and old2d is now 170 microseconds instead of 70), and tile=(2,2) is now the fastest (at 2.2 microseconds, 2 when aligning columns). That's a pretty serious regression. I haven't been able to find out why, and I haven't been able to pin down any differences. Although I think I'm seeing a lot more leaq and movqs now than I did before.

Very interesting! I get a 2.5x performance improvement from imfilter just by switching to a StaticArray, but this gives yet another 10x improvement. Since it's such a simple optimization I might steal it for ImageFiltering now (and credit the PR to you).

I made a couple changes and updated the benchmarks.

I'm now using a hack to mark the arrays as not aliasing one another. This works by defining an LLVM function that returns a ptr with the noalias attribute and making sure that it is not inlined -- if it were, LLVM will eliminate it and discard the attribute. However, this means using the hack introduces overhead for one non-inlined function call per array, even though the function is a no-op. In the case of dot products, which are very fast, this actually takes a significant bite out of their performance. Maybe one day I'll look more into what it would take to try and add an @restrict macro, where function foo(@restrict(a), b) #do things end would add the noalias attribute to a like the restrict keyword in C. I'd use it in the _avx_! definition, since this library makes those assumptions already.

But in the case of most of the other benchmarks, a handful of nanoseconds of overhead gets lost in the noise. For the unrolled image filtering benchmark, LLVM is now again eliminating redundant loads, restoring performance.

I follow this work and I am totally amazed by your work ! A dummy question : I wonder why a "hack" is required for the restrict information: it cannot be passed explicitly to LLVM ? (LLVM is used to compile C which do have the restrict keyword).

Thankfully, I found a way to remove the hack this morning. I am rerunning all the benchmarks. (When they're done, I'll issue new releases of VectorizationBase followed by SIMDPirates, and then push those benchmark results and latest changes to LoopVectorization to master.)

LLVM lets you add noalias as a function parameter attribute, which is what restrict in C does. That would be exactly what I want, but I'm pretty sure it would require a PR to Julia to implement it. (Having an @restrict macro would be cool!). Currently, I cannot add attributes to Julia function parameters. I also cannot define my Julia function inside an LLVM call, and while I can call a ccallable Julia function from llvmcall, it won't inline, meaning it won't recompile with the noalias information. Plus, the ccallable requirement means there's only a single method, which means we'd need to figure out a knew way of doing things.

LLVM also allows the noalias attribute to be added to function return values. This is what I did with the hack. Unfortunately, if the call is inlined, the attribute gets lost, meaning I need noinline.

What I'm using now is:

noalias and alias.scope metadata. These don't seem to work with masked operations.
Reordering operations to be a little better about placing stores last. load(A,i); store(B,j); load(A,i) will stop the second load from being eliminated, because the store may have changed the contents. If I instead have load(A,i); load(A,i); store(B,j), LLVM will eliminate the second load automatically (assuming there isn't another store hiding in between them).

Latest filtering benchmarks: LoopVectorization is now as fast with a dynamically sized kernel matrix as the others are with statically sized. It's much faster when it also has a statically sized kernel.

LoopVectorization now achieves roughly 45-50 double precision GFLOPS on my home computer with dynamically sized kernels over the size range of about N = 40 x 40 through 220 x 220. All of the alternatives I benchmarked only got up to around 40 GFLOPS with statically sized kernels, even with manual unrolling (in which case LoopVectorization hit the 55 to 60 GFLOPS range).

I also used CartesianIndexing with Julia when not manually unrolling:

function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
    @avx for J in CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kern)
            tmp += A[I + J] * kern[I]
        end
        out[J] = tmp
    end
    out
end

Sorry I have been unresponsive, I have been fairly tied up with helping manage the university's COVID-19 shutdown program and a spate of student advising.

This is unbelievably impressive!! I have to start playing with this pronto...which means it still might be a few days, but this is tremendously exciting.

A lot of universities are abruptly changing to online-only classes, which must be quite a transition.

This is unbelievably impressive!! I have to start playing with this pronto...which means it still might be a few days, but this is tremendously exciting.

One of the benefits of having this as Julia code we can hack on is that we can fiddle with things until they're fast.

I haven't addressed this issue in a too disciplined a way, and instead hacked on some of the code to produce desired behavior. Thus it may need some extra fiddling to also get the desired performance with AVX(2) instead of AVX512 (like on the computer I ran the benchmarks on).

Some things you can try fiddling with are the maximum unrolling factor, and convolution cost factor.

Convolution cost factor tries to recognize a pair of loops where, if you unroll both of the, it can CSE loads. The correct thing to do would be eliminate the costs of eliminated loads -- no reciprocal throughput or register pressure. Actually modeling this correctly would require a little restructuring, and changing the solve_tilesize code to be able to handle costs that don't simply change in direct proportion with the unroll factors (U and T). The maximum unroll factor of 6 helped, because when it was set to 8, it would choose to unroll the convolved column loops by 1 and 8, which doesn't actually allow for any eliminated loads at all. Capping it at 6 resulted in factors of 3 and 6, which does allow loads to be eliminated.

Because convolution cost factor just decreases the load costs by some factor when both column loops are unrolled (even if unrolled by just "1"; because the actual number isn't determined yet when convolution cost factor is called) instead of by the amount of loads getting eliminated, it can choose eventual unroll factors that don't eliminate many loads at all. You may want to try different values in place of the 4 in the maxTbase = maxUbase = VectorizationBase.REGISTER_COUNT == 32 ? 6 : 4 line. I'd be inclined to keep the 6 for architectures with 32 registers per core. 6 regardless of the register count may work (in fact, that's what it was in v0.6.22, but I changed it when making this comment). Note that whenever you change this, a few choose_order tests will fail. Not a big deal; update them (the tests are there largely so I'm aware of things changing, and know that I should benchmark to look for possible regressions).

The function convolution_cost_factor uses isoptranslation to look for convolving loops that could allow loads to be eliminated. As an approximation to eliminating loads, it reduces their cost, both in reciprocal throughput and register pressure. However, LLVM likes to unroll short statically sized loops, which would of course inflate the register pressure beyond solve_tilesize is estimating, leading to suboptimal decisions. So if some other loop that isn't already being unrolled by LoopVectorization is short enough that LLVM is likely to want to unroll that loop itself, convolution_cost_factor will not deflate the register pressure (but it still decreases the reciprocal throughput).

If the hacks start getting more complicated than the correct approach, it'd be high time to be correct.

A lot of universities are abruptly changing to online-only classes, which must be quite a transition.

The far harder part in my world has been shutting down all but nonessential experimental research, forming committees to come up with standards for what constitutes essential and then rendering verdicts on requests to continue, and finally supporting people whose careers are being disrupted. All incredibly necessary under the circumstances but it's a lot of work to stop all this on a dime. I am grateful that so much of my life is computational in nature, but still the majority of people who work with me do bench research.

JuliaSIMD / LoopVectorization.jl

Allow unrolling an arbitrary number of loops (or at least up to 4) and improve load elimination #73