Improve auto-vectorization of `for` loops

milthorpe commented 2 years ago

Chapel has the foreach construct to indicate a loop where the iterations are independent and may therefore be reordered or vectorized. However, users coming from other imperative languages are likely at first to reach for the for loop, which specifies ordered (serial) iteration. The Chapel compiler (with its llvm backend) doesn't always vectorize for loops, which may lead to a performance surprise when what appears to be a simple loop runs significantly slower for Chapel than it does for an equivalent loop in another language.

For example, the following loops in Chapel and C are intended to be equivalent:

// Chapel
proc assign3D(a: [?adom] real(32)) {
  for k in adom.dim(2) {
    a[0,0,k] = 1.0;
    a[0,1,k] = 1.0;
    a[1,0,k] = 1.0;
    a[1,1,k] = 1.0;
    a[2,0,k] = 1.0;
    a[2,1,k] = 1.0;
  }
}

// C
void assign3D(float a[3][2][N]) {
  for (size_t k=0; k<N; k++) {
    a[0][0][k] = 1.0;
    a[0][1][k] = 1.0;
    a[1][0][k] = 1.0;
    a[1][1][k] = 1.0;
    a[2][0][k] = 1.0;
    a[2][1][k] = 1.0;
  }
}

However, when compiled as follows, the C loop vectorizes, whereas the Chapel loop does not (detailed explanation below):

$ chpl --fast -g assign.chpl --mllvm "--pass-remarks=loop-vectorize --pass-remarks-analysis=loop-vectorize"
...
remark: assign.chpl:2:0: loop not vectorized: cannot prove it is safe to reorder memory operations

$ clang -g -Ofast -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize assign.c
assign.c:5:3: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-analysis=loop-vectorize]
  for (size_t k=0; k<N; k++) {
  ^
assign.c:5:3: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]

It would be nice if the Chapel compiler could provide the necessary information to the llvm backend to allow auto-vectorization of simple loops, even when they're not explicitly marked as vectorizable via foreach.

Alternatively, it could be worthwhile to update the documentation to recommend choosing foreach where the order of iterations does not matter.

Detailed explanation

The reason the Chapel loop doesn't vectorize is that the llvm LoopVectorizer assumes that any of the six memory locations written to in the body of the loop may be an alias for any of the others, so it needs to add runtime checks against this possibility in order to vectorize. The vectorizer weighs the cost of the added runtime checks (15 pairwise checks between 6 locations) against the benefit of vectorization, and decides that it's not profitable to vectorize.

The LoopVectorizer uses a configurable VectorizeMemoryCheckThreshold, so we can get the loop to vectorize by increasing the threshold:

chpl --fast -g assign.chpl --mllvm "--pass-remarks=loop-vectorize --pass-remarks-analysis=loop-vectorize --runtime-memory-check-threshold=15"
...
remark: assign.chpl:2:0: vectorized loop (vectorization width: 8, interleaved count: 4)

However, this isn't really an ideal fix - llvm is still adding the runtime memory checks, so Chapel will still be less efficient than C in this case. I suspect that by providing the correct aliasing hints to llvm, the checks can be eliminated.

If the Chapel loop is changed from for to foreach, the loop vectorizes correctly.

Interestingly, if the loop is 're-rolled' to a simple for loop over all dimensions of the domain, it also vectorizes.

for (i,j,k) in adom {
  a[i,j,k] = 1.0;
}

kwaters4 commented 1 year ago

I have not yet encountered a situation in Chapel where unrolling provides any performance benefit, albeit a small sample size. I would be curious if an example could be produced where unrolling a loop in Chapel is beneficial. Adding a comment to a "best HPC practices for Chapel" that states that unrolling loops could come at a performance cost would be nice. I would be curious to hear from other users to see if this is the general rule observed within Chapel.

mppf commented 1 year ago

This is related but not exactly an answer to your question. In our efforts to improve vectorization (with prototype integration with the Region Vectorizer) we noticed two benchmarks that showed noticeable speedup when using outer loop vectorization. The two benchmarks I remember in this category are clenshaw and mandelbrot. The clenshaw benchmark is here -- https://github.com/chapel-lang/chapel/issues/11163#issuecomment-422100955 and mandlebrot can be found at https://github.com/chapel-lang/chapel/blob/main/test/release/examples/benchmarks/shootout/mandelbrot-fast.chpl . I would expect that these same tests would benefit from manual loop unrolling. The issue is that the LLVM and C backends currently can only vectorize the innermost loop and that isn't as fast as it could be.

Edit -- come to think of it, since those are best vectorized in a way that contains inner loops running on vector-width things at the same time (aka vectorizing the outer loop), unrolling the inner loop completely might help, if it is possible, but unrolling the outer loop a bit probably wouldn't help.

bradcray commented 1 year ago

Michael beat me to it, but mandelbrot was also the case that comes to mind for me where, historically, manually unrolling the loop by 2 as in this version has resulted in performance improvements over the equivalent not-unrolled case (where I think this version is currently the one that best approximates the starting point).

Checking the nightly perf graphs it looks like the unrolled-by-two version runs at .71s or .46s depending on whether hyperthreading is used, while the non-unrolled version is more like 1.0s.

milthorpe commented 1 year ago

My 'unrolled' loop above is actually a toy example that was simplified from a real loop in the miniBUDE application, where different values are assigned to each array element according to the indices in the leading dimensions. (In other words, it's not really an unrolled loop.) I'm not really suggesting anyone should invest effort into making manual loop unrolling beneficial, rather that we don't want choosing for versus foreach to be a huge performance gotcha for new programmers.

e-kayrakli commented 1 year ago

Interestingly, if the loop is 're-rolled' to a simple for loop over all dimensions of the domain, it also vectorizes.
for (i,j,k) in adom {
  a[i,j,k] = 1.0;
}

@milthorpe -- I am curious; do we get the outer loop to vectorize if we write the inner portion as a loop, or the vectorization we get is actually from this inner loop? I expect this for by itself to vectorize because of a small optimization we have for simple fors over vectorizable iterators. But I'd be positively surprised if writing the innermost logic as a loop makes the outer loop vectorizable, too.

milthorpe commented 1 year ago

If I write the inner portion as a loop like this:

  for k in adom.dim(2) {
    for (i, j) in {0..2, 0..1} {
      a[i,j,k] = 1.0;
    }
  }

then the inner loop is vectorized, but the outer loop is not. (For the miniBUDE app, we really want the outer loop to vectorize.)

kwaters4 commented 1 year ago

I hope I did not veer the discussion too much with respect to the original thread. This has been insightful. I am currently struggling with how best to format code to take this discussion into account.

@mppf With respect to the llvm backend support for outer loop vectorization, is this a feature that is in active development and will be coming soon? Please take my ignorance of compilers into account when you read and answer the question. Should I be actively unrolling outer loops or wait a release or two and see if the compiler can handle it?

e-kayrakli commented 1 year ago

@mppf With respect to the llvm backend support for outer loop vectorization, is this a feature that is in active development and will be coming soon? Please take my ignorance of compilers into account when you read and answer the question. Should I be actively unrolling outer loops or wait a release or two and see if the compiler can handle it?

@kwaters4 -- I am hopeful that we can get some support for outer loop vectorization within a release or two. As Michael alluded in his comment, we've had some positive experience with Region Vectorizer in the past. In the very near term, I am hoping to pick up his prototype and see if we can integrate it in a more production-ready way.

chapel-lang / chapel

Improve auto-vectorization of `for` loops #20841

Detailed explanation