NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.98k stars 871 forks source link

Inquiry Regarding the Use of FP8 Type in GEMM Computations #1940

Closed unbelievable3513 closed 1 month ago

unbelievable3513 commented 1 month ago

I hope this message finds you well. I have a question regarding the use of FP8 type in GEMM computations, particularly in the context of the gemm_plugin and cublasLtMatmul functions.

In the gemm_plugin scenario, when using cublasLtMatmul for FP8 GEMM, I noticed that the compute_type is configured to FP32. This configuration seems to be consistent with the information provided in the NVIDIA CUDA documentation (https://docs.nvidia.com/cuda/cublas/#cublasltmatmul), which states that the compute_type for FP8 must be FP32. I understand that the compute_type refers to the operations outside of the matrix multiplication (opAopB) in the equation D = scale_d * (a * scale_a * scale_b * opAopB + b * scale_c * C). Does this mean that the matrix multiplication type for opAopB on Hopper remains FP8? This interpretation is further supported by the use of FP8 in matrix multiplications with cublasGemmEx and in the gemm_swiglu_plugin's DeviceGemmGatedSm90, where scaling and accumulation are performed in FP32. Is my understanding correct?

I also observed that for small batch sizes (bs <= 4), FP8 in fp8Gemm is converted to FP32 using Converter::convert(fp8, fp32), followed by FMA computation, and then dequantized using x_scale * w_scale. Could you explain the rationale behind not directly using FP8 for computation in this case?

Thank you for your attention to these questions. I look forward to your insights.

byshiue commented 1 month ago

In the gemm_plugin scenario, when using cublasLtMatmul for FP8 GEMM, I noticed that the compute_type is configured to FP32. This configuration seems to be consistent with the information provided in the NVIDIA CUDA documentation (https://docs.nvidia.com/cuda/cublas/#cublasltmatmul), which states that the compute_type for FP8 must be FP32. I understand that the compute_type refers to the operations outside of the matrix multiplication (opAopB) in the equation D = scale_d (a scale_a scale_b opAopB + b scale_c C). Does this mean that the matrix multiplication type for opAopB on Hopper remains FP8? This interpretation is further supported by the use of FP8 in matrix multiplications with cublasGemmEx and in the gemm_swiglu_plugin's DeviceGemmGatedSm90, where scaling and accumulation are performed in FP32. Is my understanding correct?

That's correct.

I also observed that for small batch sizes (bs <= 4), FP8 in fp8Gemm is converted to FP32 using Converter::convert(fp8, fp32), followed by FMA computation, and then dequantized using x_scale * w_scale. Could you explain the rationale behind not directly using FP8 for computation in this case?

It should be because we don't have fp8 operations on cuda core now.

unbelievable3513 commented 1 month ago

I also observed that for small batch sizes (bs <= 4), FP8 in fp8Gemm is converted to FP32 using Converter::convert(fp8, fp32), followed by FMA computation, and then dequantized using x_scale * w_scale. Could you explain the rationale behind not directly using FP8 for computation in this case?

It should be because we don't have fp8 operations on cuda core now.

thanks a lot, byshiue. in the selection of this kernel, the consideration is that using fma on CUDA core is more efficient for small batches (batch <= 4) than using fp8 on Tensor core, and it does not imply that FP8 does not support computations with small batches. Is that correct? @byshiue

byshiue commented 1 month ago

Correct.