TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
158 stars 10 forks source link

fix(core): Bug fix for load column-major register tile for tensor core gemm. #60

Closed haruhi55 closed 4 months ago

haruhi55 commented 4 months ago

resolve https://github.com/TiledTensor/TiledCUDA/issues/55

ldmatrix can be executed along two dimensions. In the current implementation for WMMA, the iteration over the k-dimension occurs in the innermost loop of the triple loop structure used in GEMM computations.

The bug arises because the current implementation of ldmatrix always iterates over the columns of a matrix first, followed by the rows. When operand B is laid out in column-major format, this causes a mismatch with WMMA's input data requirements.