NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.62k stars 981 forks source link

Consultation on underperformance of fp8 vs. fp16 kernels on H100 #2013

Closed unbelievable3513 closed 3 months ago

unbelievable3513 commented 3 months ago

When testing the performance comparison of llama-3-chinese-8b in fp8 and fp16 on a single-card H100 platform machine (input_token=1024, output token=200, batch_size=16, tp=1), the expected performance was achieved during the context phase. However, in the decoder phase, it was found that thefp8 performance (32.739ms) was worse than fp16 (24.488ms). After breakdown, it was seen that the main reason was that the sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas (166.991 μs) for fp8 was worse than the sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas(114.508 μs) for fp16.

I would like to confirm whether this is due to compilation configuration issues, or fp8 kernel optimization issues on H100 platform, or if there is a dequantization process actually using bf16 computation insm90_xmma_gemm_e4m3bf16_e4m3f32_f32. If it is the latter, are there any optimization plans for the future? Looking forward to your reply, thank you.

The specific environment, compilation commands, and run commands are listed below.

fp8 decoder phase(total 32.739ms) Time Total Avg Name
65.1% 21.375 ms 166.991 μs sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas
23.5% 7.731 ms 241.579 μs void tensorrt_llm::kernels::mmha::masked_multihead_attention_kernel<__nv_bfloat16, nv_fp8_e4m3, nv_fp8_e4m3, tensorrt_llm::kernels::KVLinearBuffer, tensorrt_llm::kernels::KVLinearBuffer, (unsigned int)128, (unsigned int)256, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (unsigned int)16, (unsigned int)16, (unsigned int)4, (unsigned int)8, (unsigned int)0, (unsigned int)0>(tensorrt_llm::kernels::Multihead_attention_params<T1, T8>, T4, T5)
bf16 decoder phase(total 24.488ms) Time Total Avg Name
60.3% 14.772 ms 114.508 μs sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
30.6% 7.491 ms 234.083 μs void tensorrt_llm::kernels::mmha::masked_multihead_attention_kernel<unsigned short, unsigned short, unsigned short, tensorrt_llm::kernels::KVLinearBuffer, tensorrt_llm::kernels::KVLinearBuffer, (unsigned int)128, (unsigned int)256, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (unsigned int)16, (unsigned int)16, (unsigned int)4, (unsigned int)8, (unsigned int)0, (unsigned int)0>(tensorrt_llm::kernels::Multihead_attention_params<T1, T8>, T4, T5)
unbelievable3513 commented 3 months ago

Sorry, ignore this issue please, this should be a problem with the local H100 resource, rebooted and normal. Thanks for your time.