NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.54k stars 943 forks source link

[QST] How to use TMA to load Quantized tensor in CUTE? #1892

Open ghostplant opened 1 day ago

ghostplant commented 1 day ago

What is your question? The data in global memory are stored in int8 format.

I want to use TMA to directly load it from gmem, then casting the int8 data into fp16 before saving fp16 data to smem. Is there document or examples to guide how current CUTE interface to achieve that?

thakkarV commented 1 day ago

TMA knows nothing about the tensor being quantized or not. You would load your int8 tensor with TMA just like any other int8 tensor and then perform the dequant manually in core. See our mixed input GEMM example https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm