Support user-directed loop unrolling

e-kayrakli commented 1 year ago

Loop unrolling is a common optimization. Lately, we are seeing its importance in GPU performance.

Microbenchmark for GPU

```chapel use Time; use GPU; config param innerLoopSize = 8; config const n = 10; config const printResult = false; config const validateResult = true; config const useParamUnrolled = false; const cpu = here; on here.gpus[0] { var t: stopwatch; var A: [0..#n] int; var B: [0..#n] int; t.start(); if useParamUnrolled { foreach i in A.domain by innerLoopSize{ for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } else { foreach i in A.domain by innerLoopSize { for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } t.stop(); writeln("Elapsed time: ", t.elapsed()); if printResult then writeln(A); if validateResult { on cpu { var ACpu = A; var BCpu = B; assert((+ reduce ACpu) == 2*ACpu.size); assert((+ reduce BCpu) == 2*BCpu.size); } } } ``` shows some improvement with `param` unrolling: ``` > ./unrollPerf --n=10_000_000 Elapsed time: 0.57259 > ./unrollPerf --n=10_000_000 --useParamUnrolled Elapsed time: 0.540516 ```

On a more complicated code, we are seeing 10x improvement from loop unrolling. I've also confirmed that in that case, changing the reference CUDA version to not unroll via #pragma unroll 1 shows 10x degradation, proving that the improvement we are seeing is a fundamental GPU property rather than param unrolling hiding some other issue.

Sidebar: why param unrolling is not ideal

param-unrolling requires iteration to be over param bounds. It'll always unroll and unroll fully. It is more towards iterating over heterogeneous tuples, or more broadly, the unrolling occurs before type resolution allowing some operations to be expressed that are otherwise not possible. IOW, it is more of a language feature rather than a performance optimization. Though we have been using it for that purpose because of lack of proper unrolling-as-performance-optimization. There are some GPU specific issues with param loops in kernels, as well. We should address them in general anyways, but considering how crucial loop unrolling is for GPUs we should stop relying on param unrolling as performance optimization sooner rather than later. See https://github.com/chapel-lang/chapel/issues/21893 and https://github.com/chapel-lang/chapel/issues/21606.

How can the user request unrolling?

An obvious way is using a new attribute:

@unroll 
for i in 0..n {

}

we should probably support some arguments to be passed to that attribute. At the very least, we can allow controlling the unroll depth.

How should we implement this?

We don't have to decide this alongside the design above, but I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.

How about automated unrolling?

We should consider automatically unrolling loops at least in GPU kernels. nvcc seems to be doing that as I had to use #pragma unroll 1 to observe non-unrolled performance. clang documentation also refers to more aggressive unrolling (and inlining) for GPU code as something that they are doing with sometimes huge benefits: https://llvm.org/docs/CompileCudaWithLLVM.html#optimizations. Note that, there's probably some automated unrolling happens today by virtue of typical LLVM optimizations. But I am not aware of Chapel doing anything there.

bradcray commented 1 year ago

I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.

If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?

Do you imagine supporting the @unroll attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?

e-kayrakli commented 1 year ago

If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?

Nothing too concrete. Handwaving, but, LLVM doing the unrolling can make a better integration with other LLVM optimizations. We have recently seen some unexplained interactions between different optimizations in the backend compiler. Though, if anything the benefits from those "interactions" were typically good or bad based on a coin flip.

Do you imagine supporting the @unroll attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?

All of the above. But while saying that, I recognize that things may not be very easy if the iterator for which the user asks an unrolled iteration is not a well-behaved one. Where I am guessing multiple yields within a for[each] in the follower could pose an issue, for example.

bradcray commented 1 month ago

Capturing a few more thoughts here after feeling increasingly ready for this feature recently:

I'm generally in support of this feature and of supporting it via an attribute
As noted above, I'd start by requiring the unroll amount to be specified using a param int expression because it feels the more imperative/useful and least hinty/vague/ignorable
I'd also support an "unroll completely" option
I'm not sure whether I'd support an "unroll by whatever amount you think is best" option (ever) because it's not clear to me what amount we would choose or how this would differ from not putting any unroll attribute on the loop at all would mean. Specifically, I don't think that the lack of an unroll attribute suggests that the compiler won't/can't unroll it at all, so I'm not sure what value an "unroll this by whatever amount you think is best" option would have. In any case, I wouldn't support this in the initial draft of the feature.
It's not obvious to me whether it'd be better to have the Chapel compiler or LLVM implement unrolling, or to potentially ultimately even support both options and maybe even let advanced users (like us) experiment with both
unrolling serial and foreach loops over ranges in Chapel seems easy
forall loops over ranges, 1D domains and arrays (including distributed) all seem a little trickier, but do-able
defining unrolling for other loops (loops over multidimensional domains and arrays, user-defined serial and parallel iterators) seems like it could be trickier and likely not supportable in all cases.

chapel-lang / chapel