关于permute实现方式

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

MIT License

290 stars 66 forks source link

关于permute实现方式 #8

Closed feiyuvl closed 9 months ago

feiyuvl commented 9 months ago

您好，在看mma_permuted.cu源码时，发现您使用的permute方式和DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100文档中所提到的xor方式好像有所区别。您使用的是在不同stage使用循环右移的方式来避免ldmatrix时的bank conflict，不知道理解的正不正确，希望能抽空帮忙解答，感谢。

Bruce-Lee-LY commented 9 months ago

对的

SuperCB commented 5 months ago

您好，在看mma_permuted.cu源码时，发现您使用的permute方式和DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100文档中所提到的xor方式好像有所区别。您使用的是在不同stage使用循环右移的方式来避免ldmatrix时的bank conflict，不知道理解的正不正确，希望能抽空帮忙解答，感谢。

我也在看这部分但是没太懂，可否解答一下？