taozhang9527 commented 5 months ago

System Info

CPU-X86 GPU-H100 Server XE9640 Code: TensorRT-LLM 0.8.0 release

Who can help?

@Tracin @juney-nvidia

Regarding the FP8 Post Quantization, it is mentioned in the note "Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable"

However, the --use_fp8_context_fmha enable is not an option for trtllm-build build option. All the options I can see are as follows: usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}]

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

python ../quantization/quantize.py --model_dir ./tmp/llama/70B \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --output_dir ./tllm_checkpoint_2gpu_fp8 \ --calib_size 512 \ --tp_size 2

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \ --output_dir ./engine_outputs \ --gemm_plugin float16 \ --strongly_typed \ --workers 2 \ --use_fp8_context_fmha enable

Expected behavior

Expect the engine build kicked off

actual behavior

trtllm-build: error: unrecognized arguments: --use_fp8_context_fmha enable

additional notes

There are many options for trtllm-build command. A detail documentation on those options would be useful for users to set the correct ones.

byshiue commented 5 months ago

The flag use_fp8_context_fmha is not supported in v0.8.0, it is added in v0.9.0. Please try on v0.9.0.

taozhang9527 commented 5 months ago

Yes, tried in 0.9.0, it is supported now.

What is the relationship between --use_fp8_context_fmha and --context_fmha enable. If I use --use_fp8_context_fmha, do I still need --context_fmha enable`?

In general, is there any doc for those different build options?

byshiue commented 5 months ago

Yes, you need to enable both. It is explained in https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#fp8-context-fmha.

NVIDIA / TensorRT-LLM

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description #1463

System Info

Who can help?

Information

Tasks

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`

Expected behavior

actual behavior

additional notes

NVIDIA / TensorRT-LLM

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description #1463

System Info

Who can help?

Information

Tasks

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable

Expected behavior

actual behavior

additional notes

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`