WIP: Extending `@simd` to support lexical forward dependences

JuliaLang / julia

The Julia Programming Language

MIT License

45.31k stars 5.45k forks source link

Motivation for Allowing Forward Lexical Dependencies

Rest of this note is motivation for postscript on why supporting lexical dependencies is useful. Without them, some problems require twice as many passes over data, or twice as much space. For example, consider following FDTD kernel:

function sweep( irange, jrange, U, Vx, Vy, A, B ) for j in jrange for i in irange @inbounds begin u = U[i,j] Vx[i,j] += (A[i,j+1]+A[i,j])*(U[i,j+1]-u) Vy[i,j] += (A[i+1,j]+A[i,j])*(U[i+1,j]-u) U [i,j] = u + B[i,j]*((Vx[i,j]-Vx[i,j-1]) + (Vy[i,j]-Vy[i-1,j])) end end end end

It is vectorizable even though the iterations are not completely independent, because the loop-carried dependencies are "forward lexical dependencies" that are preserved by vectorization. Informally, the key is guaranteeing that if lexical access X precedes lexical access Y in the loop body, then for two iterations m and n with m<n, Xm precedes Yn in the execution. In the example, that guarantee is necessary to ensure that each element of Vx is written before it is read, and that each element of U is read before it is written.

With the current `@simd`` semantics, which requires completely independent loop iterations, the j loop has to be split into two loops, one to update Vx and Vy, the other to update U. Doing so tends to hurt performance since it requires two passes over the data. In some other examples, the two passes require creation of a separate temporary array.

["Ordering point" part updated 2014-Oct-22, influenced by this C++ paper]

After some experimentation with LLVM, discussion on the LLVM mailing list, and discussion with Intel vectorization experts, I've decided to drop this experiment, for several reasons:

Implementing forward lexical dependencies correctly can impact a compiler all the way from AST generation down to the vectorizer. Basically, the compiler must not reorder memory accesses, even ones that look obvious reorderable (like A[i] and A[i+1]) until the vectorizer runs.
Given the tool chain available, I tried marking memory accesses with their lexical position (per recommendation in LLVM discussion). But this might cause more harm than good, because the vectorizer might have to punt if it detected that memory accesses had been reordered by an earlier clueless pass. With the current @simd semantics, this is not a problem.
For the seismic kernel example, the code slowed down, compared to manually fissioning the inner loop (one loop to update Vx and Vy; another to update U). I have not examined closely why the slowdown happened. It may have been a register pressure issue.
Explaining lexical forward dependencies is hard enough to explain to compilers. Explaining it to users is even worse :frowning:.

That said, if someone takes up the issue in the future, here's my recommended roadmap: Add an explicit "ordering point" notation that can be used with @simd loops. The semantics of the ordering point is that an iteration does not leave the ordering point until all prior iterations reach it. The ordering point is a fiction within the compiler and has no run-time cost. Teach the tool chain to not move memory accesses over such points; i.e. the point quacks like a "signal fence".

The point notation would make it obvious to code readers (human and silicon) that there is special behavior that needs to be preserved. For example, in the seismic kernel, the point could be put before the expression that updates U, to make clear there is some kind of dependence on the preceding assignments that needs to be preserved.

JuliaLang / julia

WIP: Extending `@simd` to support lexical forward dependences #8072

Motivation for Allowing Forward Lexical Dependencies