Open taozhang9527 opened 5 months ago
The flag use_fp8_context_fmha
is not supported in v0.8.0, it is added in v0.9.0. Please try on v0.9.0.
Yes, tried in 0.9.0, it is supported now.
What is the relationship between --use_fp8_context_fmha
and --context_fmha enable
. If I use --use_fp8_context_fmha
, do I still need --context_fmha enable`?
In general, is there any doc for those different build options?
Yes, you need to enable both. It is explained in https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#fp8-context-fmha.
System Info
CPU-X86 GPU-H100 Server XE9640 Code: TensorRT-LLM 0.8.0 release
Who can help?
@Tracin @juney-nvidia
Regarding the FP8 Post Quantization, it is mentioned in the note "Enable fp8 context fmha to get further acceleration by setting
--use_fp8_context_fmha enable
"However, the
--use_fp8_context_fmha enable
is not an option fortrtllm-build
build option. All the options I can see are as follows:usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}]
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint
python ../quantization/quantize.py --model_dir ./tmp/llama/70B \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --output_dir ./tllm_checkpoint_2gpu_fp8 \ --calib_size 512 \ --tp_size 2
Build trtllm engines from the trtllm checkpoint
Enable fp8 context fmha to get further acceleration by setting
--use_fp8_context_fmha enable
trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \ --output_dir ./engine_outputs \ --gemm_plugin float16 \ --strongly_typed \ --workers 2 \ --use_fp8_context_fmha enable
Expected behavior
Expect the engine build kicked off
actual behavior
trtllm-build: error: unrecognized arguments: --use_fp8_context_fmha enable
additional notes
There are many options for trtllm-build command. A detail documentation on those options would be useful for users to set the correct ones.