ldmatrix can be executed along two dimensions. In the current implementation for WMMA, the iteration over the k-dimension occurs in the innermost loop of the triple loop structure used in GEMM computations.
The bug arises because the current implementation of ldmatrix always iterates over the columns of a matrix first, followed by the rows. When operand B is laid out in column-major format, this causes a mismatch with WMMA's input data requirements.
resolve https://github.com/TiledTensor/TiledCUDA/issues/55
ldmatrix
can be executed along two dimensions. In the current implementation for WMMA, the iteration over the k-dimension occurs in the innermost loop of the triple loop structure used in GEMM computations.The bug arises because the current implementation of
ldmatrix
always iterates over the columns of a matrix first, followed by the rows. When operandB
is laid out in column-major format, this causes a mismatch with WMMA's input data requirements.