Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
MIT License
290 stars 66 forks source link

为什么B矩阵要transpose? #10

Closed luliyucoordinate closed 5 months ago

luliyucoordinate commented 6 months ago

我注意到这里的B是T的layout,为什么这样呢?我采用N的layout,如下:

image

采用padding 16的方式,然后B reg采用row-major。然后这种做法在wmma_async_stage3.cu代码A100下测试,会有10%的性能损失。这是为什么?这里面有什么说法吗🤣