TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
When testing the performance comparison of llama-3-chinese-8b in fp8 and fp16 on a single-card H100 platform machine (input_token=1024, output token=200, batch_size=16, tp=1), the expected performance was achieved during the context phase. However, in the decoder phase, it was found that thefp8 performance (32.739ms) was worse than fp16 (24.488ms). After breakdown, it was seen that the main reason was that the sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas (166.991 μs) for fp8 was worse than the sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas(114.508 μs) for fp16.
I would like to confirm whether this is due to compilation configuration issues, or fp8 kernel optimization issues on H100 platform, or if there is a dequantization process actually using bf16 computation insm90_xmma_gemm_e4m3bf16_e4m3f32_f32. If it is the latter, are there any optimization plans for the future? Looking forward to your reply, thank you.
The specific environment, compilation commands, and run commands are listed below.
When testing the performance comparison of
llama-3-chinese-8b
infp8
andfp16
on a single-cardH100
platform machine (input_token=1024, output token=200, batch_size=16, tp=1
), the expected performance was achieved during thecontext phase
. However, in thedecoder phase
, it was found that thefp8
performance (32.739ms) was worse thanfp16
(24.488ms). After breakdown, it was seen that the main reason was that thesm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas
(166.991 μs) forfp8
was worse than thesm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
(114.508 μs) forfp16
.I would like to confirm whether this is due to
compilation configuration issues
, orfp8 kernel optimization issues
onH100
platform, or if there is adequantization process actually using bf16 computation
insm90_xmma_gemm_e4m3bf16_e4m3f32_f32
. If it is the latter, are there any optimization plans for the future? Looking forward to your reply, thank you.The specific environment, compilation commands, and run commands are listed below.