JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.31k stars 5.45k forks source link

WIP: Extending `@simd` to support lexical forward dependences #8072

Closed ArchRobison closed 9 years ago

ArchRobison commented 10 years ago

@simd currently supports only loops with completely independent iterations. However, in classic vectorization, loops with "forward lexical dependencies" that are vectorizable. I'd like to extend @simd to cover such cases. (I'm currently engaged with the LLVM community on the necessary extension to LLVM.) Doing so requires some care in retaining lexical ordering information for memory accesses.

Question: when we generated LLVM IR from a Julia loop, do we generate it in lexical order, or can some early transformation reorder memory access that might have a dependence?

Motivation for Allowing Forward Lexical Dependencies

Rest of this note is motivation for postscript on why supporting lexical dependencies is useful. Without them, some problems require twice as many passes over data, or twice as much space. For example, consider following FDTD kernel:

function sweep( irange, jrange, U, Vx, Vy, A, B )
    for j in jrange
        for i in irange
            @inbounds begin
                u = U[i,j]
                Vx[i,j] += (A[i,j+1]+A[i,j])*(U[i,j+1]-u)
                Vy[i,j] += (A[i+1,j]+A[i,j])*(U[i+1,j]-u)
                U [i,j] = u + B[i,j]*((Vx[i,j]-Vx[i,j-1]) + (Vy[i,j]-Vy[i-1,j]))
            end
        end
    end
end

It is vectorizable even though the iterations are not completely independent, because the loop-carried dependencies are "forward lexical dependencies" that are preserved by vectorization. Informally, the key is guaranteeing that if lexical access X precedes lexical access Y in the loop body, then for two iterations m and n with m<n, Xm precedes Yn in the execution. In the example, that guarantee is necessary to ensure that each element of Vx is written before it is read, and that each element of U is read before it is written.

With the current `@simd`` semantics, which requires completely independent loop iterations, the j loop has to be split into two loops, one to update Vx and Vy, the other to update U. Doing so tends to hurt performance since it requires two passes over the data. In some other examples, the two passes require creation of a separate temporary array.

ArchRobison commented 9 years ago

["Ordering point" part updated 2014-Oct-22, influenced by this C++ paper]

After some experimentation with LLVM, discussion on the LLVM mailing list, and discussion with Intel vectorization experts, I've decided to drop this experiment, for several reasons:

That said, if someone takes up the issue in the future, here's my recommended roadmap: Add an explicit "ordering point" notation that can be used with @simd loops. The semantics of the ordering point is that an iteration does not leave the ordering point until all prior iterations reach it. The ordering point is a fiction within the compiler and has no run-time cost. Teach the tool chain to not move memory accesses over such points; i.e. the point quacks like a "signal fence".

The point notation would make it obvious to code readers (human and silicon) that there is special behavior that needs to be preserved. For example, in the seismic kernel, the point could be put before the expression that updates U, to make clear there is some kind of dependence on the preceding assignments that needs to be preserved.