Open milthorpe opened 2 years ago
I have not yet encountered a situation in Chapel where unrolling provides any performance benefit, albeit a small sample size. I would be curious if an example could be produced where unrolling a loop in Chapel is beneficial. Adding a comment to a "best HPC practices for Chapel" that states that unrolling loops could come at a performance cost would be nice. I would be curious to hear from other users to see if this is the general rule observed within Chapel.
This is related but not exactly an answer to your question. In our efforts to improve vectorization (with prototype integration with the Region Vectorizer) we noticed two benchmarks that showed noticeable speedup when using outer loop vectorization. The two benchmarks I remember in this category are clenshaw and mandelbrot. The clenshaw benchmark is here -- https://github.com/chapel-lang/chapel/issues/11163#issuecomment-422100955 and mandlebrot can be found at https://github.com/chapel-lang/chapel/blob/main/test/release/examples/benchmarks/shootout/mandelbrot-fast.chpl . I would expect that these same tests would benefit from manual loop unrolling. The issue is that the LLVM and C backends currently can only vectorize the innermost loop and that isn't as fast as it could be.
Edit -- come to think of it, since those are best vectorized in a way that contains inner loops running on vector-width things at the same time (aka vectorizing the outer loop), unrolling the inner loop completely might help, if it is possible, but unrolling the outer loop a bit probably wouldn't help.
Michael beat me to it, but mandelbrot was also the case that comes to mind for me where, historically, manually unrolling the loop by 2 as in this version has resulted in performance improvements over the equivalent not-unrolled case (where I think this version is currently the one that best approximates the starting point).
Checking the nightly perf graphs it looks like the unrolled-by-two version runs at .71s or .46s depending on whether hyperthreading is used, while the non-unrolled version is more like 1.0s.
My 'unrolled' loop above is actually a toy example that was simplified from a real loop in the miniBUDE application, where different values are assigned to each array element according to the indices in the leading dimensions. (In other words, it's not really an unrolled loop.) I'm not really suggesting anyone should invest effort into making manual loop unrolling beneficial, rather that we don't want choosing for
versus foreach
to be a huge performance gotcha for new programmers.
Interestingly, if the loop is 're-rolled' to a simple for loop over all dimensions of the domain, it also vectorizes.
for (i,j,k) in adom { a[i,j,k] = 1.0; }
@milthorpe -- I am curious; do we get the outer loop to vectorize if we write the inner portion as a loop, or the vectorization we get is actually from this inner loop? I expect this for
by itself to vectorize because of a small optimization we have for simple for
s over vectorizable iterators. But I'd be positively surprised if writing the innermost logic as a loop makes the outer loop vectorizable, too.
If I write the inner portion as a loop like this:
for k in adom.dim(2) {
for (i, j) in {0..2, 0..1} {
a[i,j,k] = 1.0;
}
}
then the inner loop is vectorized, but the outer loop is not. (For the miniBUDE app, we really want the outer loop to vectorize.)
I hope I did not veer the discussion too much with respect to the original thread. This has been insightful. I am currently struggling with how best to format code to take this discussion into account.
@mppf With respect to the llvm backend support for outer loop vectorization, is this a feature that is in active development and will be coming soon? Please take my ignorance of compilers into account when you read and answer the question. Should I be actively unrolling outer loops or wait a release or two and see if the compiler can handle it?
@mppf With respect to the llvm backend support for outer loop vectorization, is this a feature that is in active development and will be coming soon? Please take my ignorance of compilers into account when you read and answer the question. Should I be actively unrolling outer loops or wait a release or two and see if the compiler can handle it?
@kwaters4 -- I am hopeful that we can get some support for outer loop vectorization within a release or two. As Michael alluded in his comment, we've had some positive experience with Region Vectorizer in the past. In the very near term, I am hoping to pick up his prototype and see if we can integrate it in a more production-ready way.
Chapel has the
foreach
construct to indicate a loop where the iterations are independent and may therefore be reordered or vectorized. However, users coming from other imperative languages are likely at first to reach for thefor
loop, which specifies ordered (serial) iteration. The Chapel compiler (with its llvm backend) doesn't always vectorizefor
loops, which may lead to a performance surprise when what appears to be a simple loop runs significantly slower for Chapel than it does for an equivalent loop in another language.For example, the following loops in Chapel and C are intended to be equivalent:
However, when compiled as follows, the C loop vectorizes, whereas the Chapel loop does not (detailed explanation below):
It would be nice if the Chapel compiler could provide the necessary information to the llvm backend to allow auto-vectorization of simple loops, even when they're not explicitly marked as vectorizable via
foreach
.Alternatively, it could be worthwhile to update the documentation to recommend choosing
foreach
where the order of iterations does not matter.Detailed explanation
The reason the Chapel loop doesn't vectorize is that the llvm LoopVectorizer assumes that any of the six memory locations written to in the body of the loop may be an alias for any of the others, so it needs to add runtime checks against this possibility in order to vectorize. The vectorizer weighs the cost of the added runtime checks (15 pairwise checks between 6 locations) against the benefit of vectorization, and decides that it's not profitable to vectorize.
The LoopVectorizer uses a configurable
VectorizeMemoryCheckThreshold
, so we can get the loop to vectorize by increasing the threshold:However, this isn't really an ideal fix - llvm is still adding the runtime memory checks, so Chapel will still be less efficient than C in this case. I suspect that by providing the correct aliasing hints to llvm, the checks can be eliminated.
If the Chapel loop is changed from
for
toforeach
, the loop vectorizes correctly.Interestingly, if the loop is 're-rolled' to a simple for loop over all dimensions of the domain, it also vectorizes.