JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
742 stars 66 forks source link

Support for 'triangular' loops #77

Open MasonProtter opened 4 years ago

MasonProtter commented 4 years ago

Suppose I have a function

function pairwise(a)
    ret = Matrix{eltype(a)}(undef, length(a), length(a))
    for i in 1:(length(a)-1)
        for j in (i+1):length(a)
            ret[i, j] = a[i] * a[j]
        end
    end
    UpperTriangular(ret)
end

Is there a way to represent this loop over the 'upper triangular indices' of ret with LoopVectorization.jl?

chriselrod commented 4 years ago

I plan to add support for triangular loops. When I do, I want things like Cholesky decompositions and solving triangular systems of equations to work, because these operations are important in statistics.

I'll have to think a bit about what it will take, and from there, on how difficult it will be to implement.

chriselrod commented 4 years ago

When I do, I want things like Cholesky decompositions and solving triangular systems of equations to work, because these operations are important in statistics.

These require modeling dependencies on previous iterations, which is an orthogonal issue.

Just focusing on triangular loops, they will require:

In some cases, Like A / L or A / U (where L and U are lower and upper triangular, respectively), vectorization is easy aside from the self-referential (previous-loop iteration) dependencies (worse in the A / L case, where they force you to iterate backwards), because you'd vectorize down the rows of A.

In cases like your example, vectorization meets the triangle. For example, if it is operating block-wise, it'll need to mask all the stores on the diagonal blocks. What's the best way to handle that? 1) Don't mask, and warn users that the contents of the off-diagonal block are undefined. The disadvantage here is that of course some users may want to store two triangular/symmetric matrices together. They'd now need an architecture-dependent-sized gap between them. E.g., an N x (N + 8) matrix would be needed to hold two symmetric/triangular matrices (represented by lower and upper triangles) with single precision and AVX(2) or with double precision and AVX512, instead of just N x (N + 1). The advantage is that not bothering with it makes this aspect easier to implement, and not creating masks is faster than creating them. 2) Recalculate the masks for each iteration, e.g. for a given j, create a mask via j > i. The advantage here is that we can avoid the problems of solution 3, but the disadvantage is that creating masks on every iteration of a block diagonal could be slow. But it is O(N) vs the generally O(N^2)-or-worse of such matrix operations. 3) Unroll j by W, and apply W masks on every loop. This lets us "licm" the mask generation, but forces us to unroll by possibly largely W, and would require finally keeping better track of element types so that LoopVectorization can be aware of W while generating code. When W is 4, this is a fine approach, but if W == 8 or W == 16, this seems excessive 4) Try to be stupidly clever about it, optionally unrolling by some fraction of W instead, via:

julia> using VectorizationBase

julia> mask_2 = Mask{8}(0x03) # first two elements on, bottom 6 off
Mask{8,Bool}<1, 1, 0, 0, 0, 0, 0, 0>

julia> mask_2 ⊻= 0x3c
Mask{8,Bool}<1, 1, 1, 1, 1, 1, 0, 0>

julia> mask_2 ⊻= 0x3c
Mask{8,Bool}<1, 1, 0, 0, 0, 0, 0, 0>

julia> mask_2 ⊻= 0x3c
Mask{8,Bool}<1, 1, 1, 1, 1, 1, 0, 0>

julia> mask_2 ⊻= 0x3c
Mask{8,Bool}<1, 1, 0, 0, 0, 0, 0, 0>

twiddling from one mask to another for each diagonal block. E.g., if W were 8, we'd have 4 masks, covering cases of 1(+4), 2(+4), 3(+4), and 4(+4), using xor to flip between versions on each iteration. This might be faster than "2". On AVX512, there are 8 mask registers, so 4 could hold the store-masks, and the other 4 the xor-masks to twiddle them. That's only good for W up to 8, with the advantage of letting us only unroll by 4.

If anyone wants to manually write out a few examples of possible ways of doing things so we can benchmark them on different platforms, that'd be helpful in answering what we should be doing.

Personally, I favor 1 and 2, but 3 or 4 are possible future extensions it can optionally use if W is a reasonable unroll factor.

mdav2 commented 4 years ago

EDIT: Oh, first off, excellent work on this package. It's performance gain per unit effort is really high!

This is something I am looking forward to as well.

I've found success performing tensor contractions with explicit for loops with the help of @threads combined with @avx, coming within about a factor of two of a GEMM based approach (TensorOperations.jl). In my domain (quantum chemistry) quite frequently there are symmetries within tensors that can be exploited, which require triangular loops. I have a hunch that speed could be comparable to or exceed that of GEMM if we could use all of the symmetries properly. Here is a snippet showing what I mean.

    @inbounds @fastmath Threads.@threads for b in rvir
        @avx for a in b:nvir
            for j in rocc
                for i in rocc
                    temp = zero(eltype(tiJaB_i))
                    for n in rocc
                        for m in rocc
                            temp += tiJaB_i[m,n,a,b]*Wmnij[m,n,i,j]
                        end
                    end
                    tiJaB_d_temp3[i,j,a,b] = temp
                    tiJaB_d_temp3[j,i,b,a] = temp
                end
            end
        end
    end

The lines

                    tiJaB_d_temp3[i,j,a,b] = temp
                    tiJaB_d_temp3[j,i,b,a] = temp

and the @avx for a in b:nvir are the use of the symmetry that T[i,j,a,b] = T[j,i,b,a]. However, redundant assignments are still being performed for the i and j pair of loops, whereas a and b have been reduced to only unique pairs.

chriselrod commented 4 years ago

At best, @inbounds and @fastmath are redundant with @avx.

Have you tried something like this to take advantage of symmetry:

Threads.@threads for b in rvir
    for a in b:nvir
        for j in rocc
            @avx for i in j:last(rocc)
                temp = zero(eltype(tiJaB_i))
                for n in rocc
                    for m in rocc
                        temp += tiJaB_i[m,n,a,b]*Wmnij[m,n,i,j]
                    end
                end
                tiJaB_d_temp3[i,j,a,b] = temp
                tiJaB_d_temp3[j,i,b,a] = temp
            end
        end
    end
end

or maybe

Threads.@threads for b in rvir
    for j in rocc
        @avx for a in b:nvir
            for i in j:last(rocc)
                temp = zero(eltype(tiJaB_i))
                for n in rocc
                    for m in rocc
                        temp += tiJaB_i[m,n,a,b]*Wmnij[m,n,i,j]
                    end
                end
                tiJaB_d_temp3[i,j,a,b] = temp
                tiJaB_d_temp3[j,i,b,a] = temp
            end
        end
    end
end

If so, what's the relative performance?

Also, how large are these arrays? It's easy for 4d arrays to get very large, and memory concerns to dominate the problem (unless memory is carefully managed). But if the individual axis aren't as large, that may alleviate some of the need for blocking.

If possible, could you share GFLOPS of the computation vs what's theoretically possible? BLAS and BLAS-like operations should be able to hit 90%+, so that's an indicator of how far behind we are.

mdav2 commented 4 years ago

Sure, I'll do those comparisons and get back to you. I've never benchmarked something in terms of GFLOPS, do you have a recommendation as to how to do that? I suppose I could just compute the number of operations from the size of the arrays and divide by the time required for the contraction.

And these arrays are up to maybe 40x40x150x150 or something in that ballpark. Generally for production computations they will be stored on disk and read off in 2D or 3D slices. Either way, I'll be needing to do a contraction across some of the remaining indices, so this kind of investigation will still be worthwhile.

chriselrod commented 4 years ago

I suppose I could just compute the number of operations from the size of the arrays and divide by the time required for the contraction.

Yes, that's the idea. Something like (if we don't count the unneeded operations (due to symmetry)):

2 * length(rocc)^2 * binomial(length(rocc) + 1, 2) * binomial(....

It isn't compatible with LoopVectorization yet, but I'd definitely take a look at GFlops.jl.

You'll want to compare it with

julia> FMA_factor = LoopVectorization.VectorizationBase.FMA3 ? 2 : 1 # or FMA3 + 1
2

julia> vector_width = LoopVectorization.VectorizationBase.pick_vector_width(Float64) # or whatever the data type
8

julia> fmas_per_clock = 2
2

julia> CPU_freq = 4.1 #look this up; made complicated by downclocking vs boosting. 
4.1

julia> CPU_freq * FMA_factor * vector_width * fmas_per_clock
131.2

If you can get into the bios, that may be your best bet for the CPU frequency.

How long does it take to read 3.6*10^7 Float64 from disk relative to performing the computations? Seems like that'd likely be a major contributor to overall runtime.

mdav2 commented 4 years ago

Definitely a major contribution. I'll be mostly focusing on smaller cases that can fit in memory.

mdav2 commented 4 years ago

For both of those modifications, I get the following error:

ERROR: LoadError: LoadError: LoadError: LoadError: MethodError: Cannot `convert` an object of type Nothing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  ...
Stacktrace:
 [1] getloopid(::LoopVectorization.LoopSet, ::Symbol) at /home/mmd01986/.julia/packages/LoopVectorization/oYkvZ/src/graphs.jl:276
 [2] getloop at /home/mmd01986/.julia/packages/LoopVectorization/oYkvZ/src/graphs.jl:279 [inlined]

Any ideas what could be causing that?

chriselrod commented 4 years ago

My bad. Fixed on master. Also, I probably misunderstood. Those loops are triangular with respect to i and j, but I don't know if that's actually correct for your problem, and I did not write the extra stores to assign where i < j.