Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
290
stars
66
forks
source link
请教一个 `wmma_async_stage2.cu` 中的代码细节 #9
Closed
luliyucoordinate closed 5 months ago
为什么这一行的for循环会按照chunk_k分成两组去做? https://github.com/Bruce-Lee-LY/cuda_hgemm/blob/10a8a8451f0dcd162b3790045cd7597cb48b8beb/src/wmma/wmma_async_stage2.cu#L208
这二者之间好像没有做什么特殊的逻辑?谢谢🤣