Open vickyandpiggy opened 2 months ago
Could anyone help? Many thanks!
It seems to me that mma instructions does not support fp32 for Multiplicand A/B from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-data-types. So can i use ldmatrix alone to accelerate the copying from smem to register? Or is there better practice for full precision?
The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible.
alternatives are:
ldmatrix instruction is also tightly coupled with tensor core fragment layout, maybe you can use .tf32 version if the layout match. But they may not speedup your matrix loading, you need to take care of L2 friendly prefecting and eliminating bank conflicts manually. It is all about how to correctly and preformantly use cute.
The NVIDIA tensor core does not natively support A/B with fp32 inputs. So it is not possible.
alternatives are:
- use .tf32 version with reduced precision.
- if reduced precision is not bearable, try https://arxiv.org/pdf/2203.03341
- try AMD cards
ldmatrix instruction is also tightly coupled with tensor core fragment layout, maybe you can use .tf32 version if the layout match. But they may not speedup your matrix loading, you need to take care of L2 friendly prefecting and eliminating bank conflicts manually. It is all about how to correctly and preformantly use cute.
Thank you so much for the suggestions! I got it :)
ldmatrix instruction is also tightly coupled with tensor core fragment layout
This is not necessarily true. You can in principle copy to arbitrary layouts using LDSM provided the partitioning is valid.
The NVIDIA tensor core does not natively support A/B with fp32 inputs
This is also irrelevant. @vickyandpiggy is trying to use SIMT cores for the matmul itself. In this case, you can still totally use LDSM provided the smem layout is legal to partition with LDSM.
@vickyandpiggy Please do not be discouraged.
Is there a way to use SM75_U32x4_LDSM_N and one of the mma instructions in my case?
What have you tried? what does the kernel look like so far? btw, for SIMT tensor cores, the throughput is low enough that it should not matter whether you use ld.shared or ld.matrix. You should still be able to achieve peak throughput
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
What is your question? hello, I am developing a full precision attention backward kernel using cutlass, and get stuck in the use of ldmatrix and mma instructions for fp32.
My Gemm calculation is based on fp32 matrix, i.e. the datatype of D/A/B/C are all fp32. But the structs providied in mma_sm80.hpp take half-precision/mixed precision inputs so I am pretty confused about how to do things right in full precision. Here is my current setting for MMA, smem and gmem. Is there a way to use SM75_U32x4_LDSM_N and one of the mma instructions in my case?