NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.44k stars 922 forks source link

[BUG] Implicitly generate unexpected LDGSTS instructions for A100 #1231

Open cctry opened 10 months ago

cctry commented 10 months ago

Describe the bug Using DefaultCopy on A100 implicitly generates the unexpected LDGSTS. Users are not aware of the need to commit and wait.

Steps/Code to reproduce bug

using GmemTiledCopy = decltype(make_tiled_copy(
    Copy_Atom<DefaultCopy, float>{},
    Layout<Shape<_16, _16>, Stride<_16, _1>>{}, 
    Layout<Shape<_1, _4>>{}));

__global__ void kernel(float *A) {
  __shared__ float smem[16 * 64];
  Tensor gA = make_tensor(make_gmem_ptr(A), Shape<_16, _64>{}, make_stride(64, _1{}));
  Tensor sA = make_tensor(make_smem_ptr(smem), Layout<Shape<_16, _64>, Stride<_64, _1>>{});
  GmemTiledCopy gmem_tiled_copy;
  auto gmem_thr_copy = gmem_tiled_copy.get_thread_slice(threadIdx.x);
  Tensor tAgA = gmem_thr_copy.partition_S(gA);
  Tensor tAsA = gmem_thr_copy.partition_D(sA);
  copy(gmem_tiled_copy, tAgA, tAsA);
}

This sample code generates the SASS when compiled with -arch=sm_80.

    code for sm_80
        Function : _Z6kernelPf
    .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM80 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM80)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                             /* 0x00000a0000017a02 */
                                                                                      /* 0x000fc40000000f00 */
        /*0010*/                   S2R R5, SR_TID.X ;                                 /* 0x0000000000057919 */
                                                                                      /* 0x000e220000002100 */
        /*0020*/                   HFMA2.MMA R3, -RZ, RZ, 0, 2.384185791015625e-07 ;  /* 0x00000004ff037435 */
                                                                                      /* 0x000fe200000001ff */
        /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                       /* 0x0000460000047ab9 */
                                                                                      /* 0x000fe20000000a00 */
        /*0040*/                   SHF.L.U32 R2, R5.reuse, 0x2, RZ ;                  /* 0x0000000205027819 */
                                                                                      /* 0x041fe400000006ff */
        /*0050*/                   SHF.L.U32 R5, R5, 0x4, RZ ;                        /* 0x0000000405057819 */
                                                                                      /* 0x000fcc00000006ff */
        /*0060*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] ;          /* 0x0000580002027625 */
                                                                                      /* 0x000fca00078e0003 */
        /*0070*/                   LDGSTS.E.LTC128B.128 [R5], [R2.64] ;               /* 0x0000000002057fae */
                                                                                      /* 0x000fe2000b921d44 */
        /*0080*/                   EXIT ;                                             /* 0x000000000000794d */
                                                                                      /* 0x000fea0003800000 */
        /*0090*/                   BRA 0x90;                                          /* 0xfffffff000007947 */
                                                                                      /* 0x000fc0000383ffff */
        /*00a0*/                   NOP;                                               /* 0x0000000000007918 */
                                                                                      /* 0x000fc00000000000 */
        /*00b0*/                   NOP;                                               /* 0x0000000000007918 */

Expected behavior Unless the SM80_CP_ASYNC_* is explictly specified in copy_atom, it should not generate the LDGSTS instruction.

thakkarV commented 10 months ago

@ccecka I thought of this a year and half ago, but never brought it up. We really should not auto dispatch to LDGSTS on SM80 ...

github-actions[bot] commented 9 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 6 months ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.