Closed taozha2 closed 1 month ago
In the pvc_gemm.cpp
using TiledMma = TiledMMA<MMA_Atom<XE_8x16x16_F32BF16BF16F32_TT>,
Layout<Shape<_8,_2,_1>>,
Tile<Underscore,Underscore,Underscore>>; // Subgroup level-tile
this type should be dedicatedly designed to meet the Copy_A
and Copy_B
especially PermuteMNK
GemmComplex
launched on device to calculate the reference tensor and compare it with the result tensor. But I can guarantee they are totally same. Now the solution is a verification on host and the threshold is set to 0.5%.using GmemTiledCopyA = XE_2D_U16x32x32_LD_N;
using GmemTiledCopyB = XE_2D_U16x32x32_LD_V;
4K x 4K x4K gemm can reach 240 tflops currently
@AD2605 @aacostadiaz if there is no problem, please merge it for me
I believe throughout your tests, you have been using an old definition of syclcompat::launch, by passing the SubgroupSize as a template parameter rather than the new way of creating the kernel parameters list, see
gemm_universal_adapter.h
here for reference.I also believe that you can remove the
print
statements at the end of every test and remove thedevice_kernel.h
include in every test fileThis stands as a general comment for all the tests files.