The GEMM unit test in the master branch fails to compile after we refactored the global-to-shared loader/store to use a 16x16 BaseTile, making it be able to be compatible with shared memory swizzling.
The current implementation does not support storing floating-point numbers from shared to global memory.
To fix this, this PR modify the GEMM unit test to store GEMM's output directly from the register to global memory.
The GEMM unit test in the master branch fails to compile after we refactored the global-to-shared loader/store to use a 16x16 BaseTile, making it be able to be compatible with shared memory swizzling.
The current implementation does not support storing floating-point numbers from shared to global memory.
To fix this, this PR modify the GEMM unit test to store GEMM's output directly from the register to global memory.