Consultation on underperformance of fp8 vs. fp16 kernels on H100

Question

When testing the performance comparison of llama-3-chinese-8b in fp8 and fp16 on a single-card H100 platform machine (input_token=1024, output token=200, batch_size=16, tp=1), the expected performance was achieved during the context phase. However, in the decoder phase, it was found that thefp8 performance (32.739ms) was worse than fp16 (24.488ms). After breakdown, it was seen that the main reason was that the sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas (166.991 μs) for fp8 was worse than the sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas(114.508 μs) for fp16.

I would like to confirm whether this is due to compilation configuration issues, or fp8 kernel optimization issues on H100 platform, or if there is a dequantization process actually using bf16 computation insm90_xmma_gemm_e4m3bf16_e4m3f32_f32. If it is the latter, are there any optimization plans for the future? Looking forward to your reply, thank you.

The specific environment, compilation commands, and run commands are listed below.

Breakdown Data

fp8 decoder phase（total 32.739ms）	Time	Total	Avg	Name
65.1%	21.375 ms	166.991 μs	sm90_xmma_gemm_e4m3bf16_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize2x1x1_algo2_execute_segment_k_off_kernel__5x_cublas
23.5%	7.731 ms	241.579 μs	void tensorrt_llm::kernels::mmha::masked_multihead_attention_kernel<__nv_bfloat16, nv_fp8_e4m3, nv_fp8_e4m3, tensorrt_llm::kernels::KVLinearBuffer, tensorrt_llm::kernels::KVLinearBuffer, (unsigned int)128, (unsigned int)256, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (unsigned int)16, (unsigned int)16, (unsigned int)4, (unsigned int)8, (unsigned int)0, (unsigned int)0>(tensorrt_llm::kernels::Multihead_attention_params<T1, T8>, T4, T5)

bf16 decoder phase（total 24.488ms）	Time	Total	Avg	Name
60.3%	14.772 ms	114.508 μs	sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
30.6%	7.491 ms	234.083 μs	void tensorrt_llm::kernels::mmha::masked_multihead_attention_kernel<unsigned short, unsigned short, unsigned short, tensorrt_llm::kernels::KVLinearBuffer, tensorrt_llm::kernels::KVLinearBuffer, (unsigned int)128, (unsigned int)256, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (bool)0, (unsigned int)16, (unsigned int)16, (unsigned int)4, (unsigned int)8, (unsigned int)0, (unsigned int)0>(tensorrt_llm::kernels::Multihead_attention_params<T1, T8>, T4, T5)

Command

trtllm-build \
                --checkpoint_dir $trt_ckpt_dir \
                --output_dir $engine_output_dir \
                --context_fmha_fp32_acc enable \
                --enable_xqa enable \
                --gemm_plugin fp8
                --gpt_attention_plugin auto \
                --max_batch_size 128
                --max_input_len 2048
                --max_seq_len 2248
                --max_num_tokens 262145
                --multi_block_mode enable \
                --paged_kv_cache disable \
                --remove_input_padding enable \
                --use_custom_all_reduce enable \
                --use_fused_mlp \
                --workers 8

mpirun --allow-run-as-root -n 1 \
     TensorRT-LLM/cpp/build/benchmarks/gptSessionBenchmark
     --engine_dir $engine_output_dir \
     --batch_size 16 \
     --input_output_len 1024,200 \
     --log_level "info" \
     --num_runs 1 \
     --duration 30"

Environment

nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-dali-cuda120 1.28.0
nvidia-modelopt 0.13.1
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.1.105
nvidia-pyindex 1.0.9
pytorch-quantization 2.1.2
tensorrt 10.1.0
tensorrt-cu12 10.1.0
tensorrt-cu12-bindings 10.1.0
tensorrt-cu12-libs 10.1.0
tensorrt-llm 0.12.0.dev2024071600
tokenizers 0.19.1
torch 2.3.1
torchdata 0.7.0a0
torchtext 0.16.0a0
torchvision 0.16.0a0
transformers 4.42.4

NVIDIA / TensorRT-LLM

Consultation on underperformance of fp8 vs. fp16 kernels on H100 #2013