NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description #1463

Open taozhang9527 opened 5 months ago

taozhang9527 commented 5 months ago

System Info

CPU-X86 GPU-H100 Server XE9640 Code: TensorRT-LLM 0.8.0 release

Who can help?

@Tracin @juney-nvidia

Regarding the FP8 Post Quantization, it is mentioned in the note "Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable"

However, the --use_fp8_context_fmha enable is not an option for trtllm-build build option. All the options I can see are as follows: usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}]

Information

Tasks

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

python ../quantization/quantize.py --model_dir ./tmp/llama/70B \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --output_dir ./tllm_checkpoint_2gpu_fp8 \ --calib_size 512 \ --tp_size 2

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \ --output_dir ./engine_outputs \ --gemm_plugin float16 \ --strongly_typed \ --workers 2 \ --use_fp8_context_fmha enable

Expected behavior

Expect the engine build kicked off

actual behavior

trtllm-build: error: unrecognized arguments: --use_fp8_context_fmha enable

additional notes

There are many options for trtllm-build command. A detail documentation on those options would be useful for users to set the correct ones.

byshiue commented 5 months ago

The flag use_fp8_context_fmha is not supported in v0.8.0, it is added in v0.9.0. Please try on v0.9.0.

taozhang9527 commented 5 months ago

Yes, tried in 0.9.0, it is supported now.

What is the relationship between --use_fp8_context_fmha and --context_fmha enable. If I use --use_fp8_context_fmha, do I still need --context_fmha enable`?

In general, is there any doc for those different build options?

byshiue commented 5 months ago

Yes, you need to enable both. It is explained in https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#fp8-context-fmha.