Closed mrowan137 closed 6 months ago
@mrowan137 Thank you for pointing that out. That indeed is what has happened in the unroll example. Not including unroll directive leads to compiler default optimization of unroll factor of 128. This results in large the register usage as well as lower occupancy. I've updated the same example but changed the narrative. The changes should be updated soon in the blog.
PR has been merged and incorporated into the blog post. Closing this.
Problem Description
Thanks for adding the new article on reading AMDGCN ISA! One question that came up reading this, the section about loop unrolling is quoting results counter to my expectations.
Usually I'd expect loop unrolling (relative to a baseline of no unrolling) to increase register pressure and potentially decrease occupancy, the cost of which is to reduce control-flow logic, provide more opportunities for the compiler to optimize, and possibly improve caching. I did quick sanity checks of my own and saw results I expected with
pragma unroll
, i.e. register usage increase and occupancy decrease.I believe what's going on to lead to unintuitive results in the discussion is comparing absence and presence of
pragma unroll
; my understanding is that for the former, the compiler could still choose to unroll. To compare with a not-unrolled case, you could compare with apragma unroll 1
. This will show expected results, i.e. loop unrolling (compared to no unrolling) increases register usage (and potentially decreases occupancy).I think this will be improved for the reader (for educational purposes) to stick to a more intuitive example, or at least to provide some explanation why the counterintuitive behavior (if this is indeed not an error).
@seanofthemillers @CRobeck @asitav @suyashtn @nicholasmalaya
Operating System
N/A
CPU
N/A
GPU
N/A
ROCm Version
N/A
ROCm Component
No response
Steps to Reproduce
No response
Output of /opt/rocm/bin/rocminfo --support
N/A