codeplaysoftware / cutlass-fork

CUDA Templates for Linear Algebra Subroutines
Other
8 stars 20 forks source link

Cooperative prefetch #151

Closed jiyang1011 closed 2 weeks ago

jiyang1011 commented 3 weeks ago

4K gemm performance 320tflops when iter = ~20-30, 280tflops when iter =100 algorithm in XeTLA