Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
MIT License
290 stars 66 forks source link

咨询:Share Mem bank Confict. #4

Closed matrix97317 closed 10 months ago

matrix97317 commented 11 months ago

Hi, 我想咨询下,Share Mem to Register 出现mem bank confict的解决方案。尤其是Permute方法。 对于一个KxN 的矩阵B,K=16,N=16. 其share mem 数据排布如下: data_x其大小2byte

data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 ... ------------------------------------------------------------ 8 data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 data0,data1,data2,...,data8,| ... data15 ------------------------------------------------------------ 16

我使用 ldmatrix.sync.aligned.x2.trans.m8n8.shared.b16 {%0, %1}, [%2];\n load数据 其shape (16, 8), 然后出现了2个 Bank Confict. 想问下,这里如何解决其 Bank Confict.

如能讨论不胜感激。

Bruce-Lee-LY commented 10 months ago

代码里提供了无bank confict的ldmatrix的permuted加载方法,这里注意ldmatrix是分多次s2r,需要保证每一次都没有bank confict