chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 414 forks source link

Support user-directed loop unrolling #22191

Open e-kayrakli opened 1 year ago

e-kayrakli commented 1 year ago

Loop unrolling is a common optimization. Lately, we are seeing its importance in GPU performance.

Microbenchmark for GPU ```chapel use Time; use GPU; config param innerLoopSize = 8; config const n = 10; config const printResult = false; config const validateResult = true; config const useParamUnrolled = false; const cpu = here; on here.gpus[0] { var t: stopwatch; var A: [0..#n] int; var B: [0..#n] int; t.start(); if useParamUnrolled { foreach i in A.domain by innerLoopSize{ for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for param j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } else { foreach i in A.domain by innerLoopSize { for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } for j in 0..#innerLoopSize { A[i+j] += 1; B[i+j] += 1; } } } t.stop(); writeln("Elapsed time: ", t.elapsed()); if printResult then writeln(A); if validateResult { on cpu { var ACpu = A; var BCpu = B; assert((+ reduce ACpu) == 2*ACpu.size); assert((+ reduce BCpu) == 2*BCpu.size); } } } ``` shows some improvement with `param` unrolling: ``` > ./unrollPerf --n=10_000_000 Elapsed time: 0.57259 > ./unrollPerf --n=10_000_000 --useParamUnrolled Elapsed time: 0.540516 ```

On a more complicated code, we are seeing 10x improvement from loop unrolling. I've also confirmed that in that case, changing the reference CUDA version to not unroll via #pragma unroll 1 shows 10x degradation, proving that the improvement we are seeing is a fundamental GPU property rather than param unrolling hiding some other issue.

Sidebar: why param unrolling is not ideal

param-unrolling requires iteration to be over param bounds. It'll always unroll and unroll fully. It is more towards iterating over heterogeneous tuples, or more broadly, the unrolling occurs before type resolution allowing some operations to be expressed that are otherwise not possible. IOW, it is more of a language feature rather than a performance optimization. Though we have been using it for that purpose because of lack of proper unrolling-as-performance-optimization. There are some GPU specific issues with param loops in kernels, as well. We should address them in general anyways, but considering how crucial loop unrolling is for GPUs we should stop relying on param unrolling as performance optimization sooner rather than later. See https://github.com/chapel-lang/chapel/issues/21893 and https://github.com/chapel-lang/chapel/issues/21606.

How can the user request unrolling?

An obvious way is using a new attribute:

@unroll 
for i in 0..n {

}

we should probably support some arguments to be passed to that attribute. At the very least, we can allow controlling the unroll depth.

How should we implement this?

We don't have to decide this alongside the design above, but I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.

How about automated unrolling?

We should consider automatically unrolling loops at least in GPU kernels. nvcc seems to be doing that as I had to use #pragma unroll 1 to observe non-unrolled performance. clang documentation also refers to more aggressive unrolling (and inlining) for GPU code as something that they are doing with sometimes huge benefits: https://llvm.org/docs/CompileCudaWithLLVM.html#optimizations. Note that, there's probably some automated unrolling happens today by virtue of typical LLVM optimizations. But I am not aware of Chapel doing anything there.

bradcray commented 1 year ago

I think we should ask LLVM to do it for us rather than us unrolling the loop in the Chapel compiler. There may be some heuristics that we can benefit from LLVM beyond potentially complicated implementation work.

If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?

Do you imagine supporting the @unroll attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?

e-kayrakli commented 1 year ago

If the unrolling amount were specified by the user (which is where I'd start with this feature), is there any reason to believe LLVM unrolling the loop would inherently be better/different than Chapel doing it?

Nothing too concrete. Handwaving, but, LLVM doing the unrolling can make a better integration with other LLVM optimizations. We have recently seen some unexplained interactions between different optimizations in the backend compiler. Though, if anything the benefits from those "interactions" were typically good or bad based on a coin flip.

Do you imagine supporting the @unroll attribute for loops other than loops over ranges? (e.g., domains, arrays, user-defined iterators, zippered iterations?) Parallel (foreach / forall) loops?

All of the above. But while saying that, I recognize that things may not be very easy if the iterator for which the user asks an unrolled iteration is not a well-behaved one. Where I am guessing multiple yields within a for[each] in the follower could pose an issue, for example.

bradcray commented 1 month ago

Capturing a few more thoughts here after feeling increasingly ready for this feature recently: