Open ghostplant opened 1 day ago
TMA knows nothing about the tensor being quantized or not. You would load your int8 tensor with TMA just like any other int8 tensor and then perform the dequant manually in core. See our mixed input GEMM example https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm
What is your question? The data in global memory are stored in int8 format.
I want to use TMA to directly load it from
gmem
, then casting the int8 data into fp16 before saving fp16 data tosmem
. Is there document or examples to guide how current CUTE interface to achieve that?