NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.82k stars 1.01k forks source link

What are the suggested arguments to build an efficient engine? #1417

Closed sleepwalker2017 closed 6 months ago

sleepwalker2017 commented 7 months ago

I'm reading the manual here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md The scripts are so simple, do they ensure best performance?

python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \
                              --output_dir ./tllm_checkpoint_1gpu_fp16 \
                              --dtype float16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
            --output_dir ./tmp/llama/7B/trt_engines/fp16/1-gpu \
            --gemm_plugin float16

I can't find a full configuration for building llama engine. Is there any?

And also, the trt-build --help gives a lot of options, but I can't find the meaning and default value for many of them. How should we choose from these options?

options:
  -h, --help            show this help message and exit
  --checkpoint_dir CHECKPOINT_DIR
  --model_config MODEL_CONFIG
  --build_config BUILD_CONFIG
  --model_cls_file MODEL_CLS_FILE
  --model_cls_name MODEL_CLS_NAME
  --input_timing_cache INPUT_TIMING_CACHE
                        The path to read timing cache, will be ignored if the file does not exist
  --output_timing_cache OUTPUT_TIMING_CACHE
                        The path to write timing cache
  --log_level LOG_LEVEL
  --profiling_verbosity {layer_names_only,detailed,none}
                        The profiling verbosity for the generated TRT engine. Set to detailed can inspect tactic choices and kernel parameters.
  --enable_debug_output
  --output_dir OUTPUT_DIR
                        The path to save the serialized engine files and model configs
  --workers WORKERS     The number of workers for building in parallel
  --max_batch_size MAX_BATCH_SIZE
  --max_input_len MAX_INPUT_LEN
  --max_output_len MAX_OUTPUT_LEN
  --max_beam_width MAX_BEAM_WIDTH
  --max_num_tokens MAX_NUM_TOKENS
  --opt_num_tokens OPT_NUM_TOKENS
                        It equals to max_batch_size*max_beam_width by default, set this value as close as possible to the actual number of tokens on your workload. Note that this argument might be removed in
                        the future.
  --tp_size TP_SIZE
  --pp_size PP_SIZE
  --max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE, --max_multimodal_len MAX_PROMPT_EMBEDDING_TABLE_SIZE
                        Setting to a value > 0 enables support for prompt tuning or multimodal input.
  --use_fused_mlp       Enable horizontal fusion in GatedMLP, reduces layer input traffic and potentially improves performance. For FP8 PTQ, the downside is slight reduction of accuracy because one of the
                        quantization scaling factors are discarded. (An example for reference only: 0.45734 vs 0.45755 for LLaMA-v2 7B using `ammo/examples/hf/instruct_eval/mmlu.py`).
  --gather_all_token_logits
                        Enable both gather_context_logits and gather_generation_logits
  --gather_context_logits
                        Gather context logits
  --gather_generation_logits
                        Gather generation logits
  --strongly_typed      This option is introduced with TensorRT 9.1.0.1+ and will reduce the engine building time. It's not expected to see performance or accuracy regression after enable this flag. Note that,
                        we may remove this flag in the future, and enable the feature by default.
  --builder_opt BUILDER_OPT
plugin_config:
  --bert_attention_plugin {float16,float32,bfloat16,disable}
  --gpt_attention_plugin {float16,float32,bfloat16,disable}
  --gemm_plugin {float16,float32,bfloat16,disable}
  --lookup_plugin {float16,float32,bfloat16,disable}
  --lora_plugin {float16,float32,bfloat16,disable}
  --moe_plugin {float16,float32,bfloat16,disable}
  --mamba_conv1d_plugin {float16,float32,bfloat16,disable}
  --context_fmha {enable,disable}
  --context_fmha_fp32_acc {enable,disable}
  --paged_kv_cache {enable,disable}
  --remove_input_padding {enable,disable}
  --use_custom_all_reduce {enable,disable}
  --multi_block_mode {enable,disable}
  --enable_xqa {enable,disable}
  --attention_qk_half_accumulation {enable,disable}
  --tokens_per_block TOKENS_PER_BLOCK
  --use_paged_context_fmha {enable,disable}
  --use_fp8_context_fmha {enable,disable}
  --use_context_fmha_for_generation {enable,disable}
  --multiple_profiles {enable,disable}
  --paged_state {enable,disable}
  --streamingllm {enable,disable}
lemousehunter commented 7 months ago

Hey, I believe the devs have some best-practices (or suggested optimizations) listed here

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 15 days with no activity.