Enable CUTE APIs (Copy, MMA etc.) for Intel GPU (PVC)

AD2605 commented 2 months ago

I believe throughout your tests, you have been using an old definition of syclcompat::launch, by passing the SubgroupSize as a template parameter rather than the new way of creating the kernel parameters list, see gemm_universal_adapter.h here for reference.

I also believe that you can remove the print statements at the end of every test and remove the device_kernel.h include in every test file

This stands as a general comment for all the tests files.

jiyang1011 commented 1 month ago

In the pvc_gemm.cpp

using TiledMma = TiledMMA<MMA_Atom<XE_8x16x16_F32BF16BF16F32_TT>,
                           Layout<Shape<_8,_2,_1>>,
                           Tile<Underscore,Underscore,Underscore>>; // Subgroup level-tile

this type should be dedicatedly designed to meet the Copy_A and Copy_B especially PermuteMNK

The original verification use GemmComplex launched on device to calculate the reference tensor and compare it with the result tensor. But I can guarantee they are totally same. Now the solution is a verification on host and the threshold is set to 0.5%.

using GmemTiledCopyA = XE_2D_U16x32x32_LD_N;
using GmemTiledCopyB = XE_2D_U16x32x32_LD_V;

4K x 4K x4K gemm can reach 240 tflops currently

jiyang1011 commented 1 month ago

@AD2605 @aacostadiaz if there is no problem, please merge it for me

codeplaysoftware / cutlass-fork

Enable CUTE APIs (Copy, MMA etc.) for Intel GPU (PVC) #131