[QST] SmemCopyAtom and MMA_Atom for fp32?

vickyandpiggy commented 2 months ago

What is your question? hello, I am developing a full precision attention backward kernel using cutlass, and get stuck in the use of ldmatrix and mma instructions for fp32.

My Gemm calculation is based on fp32 matrix, i.e. the datatype of D/A/B/C are all fp32. But the structs providied in mma_sm80.hpp take half-precision/mixed precision inputs so I am pretty confused about how to do things right in full precision. Here is my current setting for MMA, smem and gmem. Is there a way to use SM75_U32x4_LDSM_N and one of the mma instructions in my case?

  // MMA
  using TiledMma = TiledMMA<MMA_Atom<UniversalFMA<float, float, float>>, Layout<Shape<Int<16>, Int<8>, _1>>>;

  // Smem
  using SmemLayoutAtom = decltype(
    composition(Swizzle<3,3,3>{},
                Layout<Shape < _16,_32>,
                       Stride<_32, _1>>{}));
  using SmemCopyAtom = Copy_Atom<DefaultCopy, float>;

  // Gmem
  using GmemTiledCopy = decltype(
    make_tiled_copy(Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<cute::uint128_t>, float>{},
                    Layout<Shape <_16,_8>,
                           Stride< _8,_1>>{},
                    Layout<Shape < _1,_4>>{}));

vickyandpiggy commented 1 month ago

Could anyone help? Many thanks!

vickyandpiggy commented 1 month ago

It seems to me that mma instructions does not support fp32 for Multiplicand A/B from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-data-types. So can i use ldmatrix alone to accelerate the copying from smem to register? Or is there better practice for full precision?

cloudhan commented 1 month ago

The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible.

alternatives are:

use .tf32 version with reduced precision.
if reduced precision is not bearable, try https://arxiv.org/pdf/2203.03341
try AMD cards

ldmatrix instruction is also tightly coupled with tensor core fragment layout, maybe you can use .tf32 version if the layout match. But they may not speedup your matrix loading, you need to take care of L2 friendly prefecting and eliminating bank conflicts manually. It is all about how to correctly and preformantly use cute.

vickyandpiggy commented 1 month ago

The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible.

alternatives are:

use .tf32 version with reduced precision.

if reduced precision is not bearable, try https://arxiv.org/pdf/2203.03341

try AMD cards

ldmatrix instruction is also tightly coupled with tensor core fragment layout, maybe you can use .tf32 version if the layout match. But they may not speedup your matrix loading, you need to take care of L2 friendly prefecting and eliminating bank conflicts manually. It is all about how to correctly and preformantly use cute.

Thank you so much for the suggestions! I got it :)

thakkarV commented 1 month ago

ldmatrix instruction is also tightly coupled with tensor core fragment layout

This is not necessarily true. You can in principle copy to arbitrary layouts using LDSM provided the partitioning is valid.

The NVIDIA tensor core does not natively support A/B with fp32 inputs

This is also irrelevant. @vickyandpiggy is trying to use SIMT cores for the matmul itself. In this case, you can still totally use LDSM provided the smem layout is legal to partition with LDSM.

thakkarV commented 1 month ago

@vickyandpiggy Please do not be discouraged.

Is there a way to use SM75_U32x4_LDSM_N and one of the mma instructions in my case?

What have you tried? what does the kernel look like so far? btw, for SIMT tensor cores, the throughput is low enough that it should not matter whether you use ld.shared or ld.matrix. You should still be able to achieve peak throughput

github-actions[bot] commented 2 weeks ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

NVIDIA / cutlass

[QST] SmemCopyAtom and MMA_Atom for fp32? #1842