ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
208 stars 142 forks source link

XCC-based workgroup remapping for stream-k kernels #1928

Closed AlexBrownAMD closed 2 months ago

AlexBrownAMD commented 2 months ago

By default, workgroups are scheduled in a round-robin fashion. This change adds a mode to reorganize workgroups based on which XCC it is running on for better locality of tiles and improved cache hit rate.

AlexBrownAMD commented 2 months ago

We will need another PR to fix the code using scalarStaticDivideAndRemainder.

I will create a ticket to keep track of this