ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
212 stars 145 forks source link

Two-tile algorithm with SK after DP #1918

Closed AlexBrownAMD closed 4 months ago

AlexBrownAMD commented 4 months ago

Alternative implementation of the 2-tile algorithm that does DP tiles first and SK tiles after. This method should have a small boost in performance.

nakajee commented 4 months ago

Is there any document which explains how this works? It is difficult to understand the behavior from asm code. Uploading to the corresponding ticket or any way is OK.

AlexBrownAMD commented 4 months ago

Is there any document which explains how this works? It is difficult to understand the behavior from asm code. Uploading to the corresponding ticket or any way is OK.

Longer description added to the ticket