[Issue]: Reading AMDGCN ISA: loop unrolling

mrowan137 commented 6 months ago

Problem Description

Thanks for adding the new article on reading AMDGCN ISA! One question that came up reading this, the section about loop unrolling is quoting results counter to my expectations.

Compiler directive pragma unroll can be very effective in optimizing a kernel performance, for example, in reducing register pressure and improving occupancy. ... With pragma unroll compiler optimization, we notice about two-fold reduction in VGPRs and improved occupancy of 8 waves/SIMD (100%) from 5 waves/SIMD.

Usually I'd expect loop unrolling (relative to a baseline of no unrolling) to increase register pressure and potentially decrease occupancy, the cost of which is to reduce control-flow logic, provide more opportunities for the compiler to optimize, and possibly improve caching. I did quick sanity checks of my own and saw results I expected with pragma unroll, i.e. register usage increase and occupancy decrease.

I believe what's going on to lead to unintuitive results in the discussion is comparing absence and presence of pragma unroll; my understanding is that for the former, the compiler could still choose to unroll. To compare with a not-unrolled case, you could compare with a pragma unroll 1. This will show expected results, i.e. loop unrolling (compared to no unrolling) increases register usage (and potentially decreases occupancy).

I think this will be improved for the reader (for educational purposes) to stick to a more intuitive example, or at least to provide some explanation why the counterintuitive behavior (if this is indeed not an error).

@seanofthemillers @CRobeck @asitav @suyashtn @nicholasmalaya

Operating System

N/A

CPU

N/A

GPU

N/A

ROCm Version

N/A

ROCm Component

No response

Steps to Reproduce

No response

Output of /opt/rocm/bin/rocminfo --support

N/A

asitav commented 6 months ago

@mrowan137 Thank you for pointing that out. That indeed is what has happened in the unroll example. Not including unroll directive leads to compiler default optimization of unroll factor of 128. This results in large the register usage as well as lower occupancy. I've updated the same example but changed the narrative. The changes should be updated soon in the blog.

asitav commented 6 months ago

PR has been merged and incorporated into the blog post. Closing this.

ROCm / rocm-blogs