Open e-kayrakli opened 1 year ago
I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.
If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?
Do you imagine supporting the @unroll
attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?
If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?
Nothing too concrete. Handwaving, but, LLVM doing the unrolling can make a better integration with other LLVM optimizations. We have recently seen some unexplained interactions between different optimizations in the backend compiler. Though, if anything the benefits from those "interactions" were typically good or bad based on a coin flip.
Do you imagine supporting the @unroll attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?
All of the above. But while saying that, I recognize that things may not be very easy if the iterator for which the user asks an unrolled iteration is not a well-behaved one. Where I am guessing multiple yield
s within a for[each]
in the follower could pose an issue, for example.
Capturing a few more thoughts here after feeling increasingly ready for this feature recently:
Loop unrolling is a common optimization. Lately, we are seeing its importance in GPU performance.
Microbenchmark for GPU
```chapel use Time; use GPU; config param innerLoopSize = 8; config const n = 10; config const printResult = false; config const validateResult = true; config const useParamUnrolled = false; const cpu = here; on here.gpus[0] { var t: stopwatch; var A: [0..#n] int; var B: [0..#n] int; t.start(); if useParamUnrolled { foreach i in A.domain by innerLoopSize{ for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } else { foreach i in A.domain by innerLoopSize { for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } t.stop(); writeln("Elapsed time: ", t.elapsed()); if printResult then writeln(A); if validateResult { on cpu { var ACpu = A; var BCpu = B; assert((+ reduce ACpu) == 2*ACpu.size); assert((+ reduce BCpu) == 2*BCpu.size); } } } ``` shows some improvement with `param` unrolling: ``` > ./unrollPerf --n=10_000_000 Elapsed time: 0.57259 > ./unrollPerf --n=10_000_000 --useParamUnrolled Elapsed time: 0.540516 ```On a more complicated code, we are seeing 10x improvement from loop unrolling. I've also confirmed that in that case, changing the reference CUDA version to not unroll via
#pragma unroll 1
shows 10x degradation, proving that the improvement we are seeing is a fundamental GPU property rather thanparam
unrolling hiding some other issue.Sidebar: why param unrolling is not ideal
param
-unrolling requires iteration to be overparam
bounds. It'll always unroll and unroll fully. It is more towards iterating over heterogeneous tuples, or more broadly, the unrolling occurs before type resolution allowing some operations to be expressed that are otherwise not possible. IOW, it is more of a language feature rather than a performance optimization. Though we have been using it for that purpose because of lack of proper unrolling-as-performance-optimization. There are some GPU specific issues with param loops in kernels, as well. We should address them in general anyways, but considering how crucial loop unrolling is for GPUs we should stop relying on param unrolling as performance optimization sooner rather than later. See https://github.com/chapel-lang/chapel/issues/21893 and https://github.com/chapel-lang/chapel/issues/21606.How can the user request unrolling?
An obvious way is using a new attribute:
we should probably support some arguments to be passed to that attribute. At the very least, we can allow controlling the unroll depth.
How should we implement this?
We don't have to decide this alongside the design above, but I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.
How about automated unrolling?
We should consider automatically unrolling loops at least in GPU kernels.
nvcc
seems to be doing that as I had to use#pragma unroll 1
to observe non-unrolled performance.clang
documentation also refers to more aggressive unrolling (and inlining) for GPU code as something that they are doing with sometimes huge benefits: https://llvm.org/docs/CompileCudaWithLLVM.html#optimizations. Note that, there's probably some automated unrolling happens today by virtue of typical LLVM optimizations. But I am not aware of Chapel doing anything there.