codeplaysoftware / cutlass-fork

CUDA Templates for Linear Algebra Subroutines
Other
8 stars 20 forks source link

Enable CUTE APIs (Copy, MMA etc.) for Intel GPU (PVC) #131

Closed taozha2 closed 1 month ago

AD2605 commented 2 months ago

I believe throughout your tests, you have been using an old definition of syclcompat::launch, by passing the SubgroupSize as a template parameter rather than the new way of creating the kernel parameters list, see gemm_universal_adapter.h here for reference.

I also believe that you can remove the print statements at the end of every test and remove the device_kernel.h include in every test file

This stands as a general comment for all the tests files.

jiyang1011 commented 1 month ago

In the pvc_gemm.cpp

  1. using TiledMma = TiledMMA<MMA_Atom<XE_8x16x16_F32BF16BF16F32_TT>,
                               Layout<Shape<_8,_2,_1>>,
                               Tile<Underscore,Underscore,Underscore>>; // Subgroup level-tile

    this type should be dedicatedly designed to meet the Copy_A and Copy_B especially PermuteMNK

  2. The original verification use GemmComplex launched on device to calculate the reference tensor and compare it with the result tensor. But I can guarantee they are totally same. Now the solution is a verification on host and the threshold is set to 0.5%.
  3. using GmemTiledCopyA = XE_2D_U16x32x32_LD_N;
    using GmemTiledCopyB = XE_2D_U16x32x32_LD_V;

    4K x 4K x4K gemm can reach 240 tflops currently

jiyang1011 commented 1 month ago

@AD2605 @aacostadiaz if there is no problem, please merge it for me