NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Qwen1.5 Model 'tensorrt_llm' loading failed with error: key 'use_context_fmha_for_generation' not found #1812

Open HowardChenRV opened 1 week ago

HowardChenRV commented 1 week ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

git clone https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM cd examples/qwen pip install -r requirements.txt python convert_checkpoint.py --model_dir /share/datasets/public_models/Qwen_Qwen1.5-72B-Chat/ \ --output_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/checkpoint/qwen1.5_72b_chat_tllm_checkpoint_4gpu_tp4 \ --dtype float16 \ --tp_size 4 \ --workers 4

trtllm-build --checkpoint_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/checkpoint/qwen1.5_72b_chat_tllm_checkpoint_4gpu_tp4 \ --output_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/Qwen_Qwen1.5-72B-Chat_TP4/ \ --gemm_plugin float16 \ --workers 4

docker run --rm -it --gpus all --net host --shm-size=2g \ --ulimit stack=67108864 \ -v /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/Qwen_Qwen1.5-72B-Chat_TP4/tensorrtllm_backend:/tensorrtllm_backend \ -v /share/datasets/public_models/Qwen_Qwen1.5-72B-Chat:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat \ nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash

cd /tensorrtllm_backend

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo

Expected behavior

I would expect the tensorrt engine to work with the triton inference server

actual behavior

b56db48c-5f6a-4df3-a691-88ca94cbbc9c

additional notes

Triton Inference Server used : nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Model used: Qwen_Qwen1.5-72B-Chat

HowardChenRV commented 1 week ago

Update: used tensorrt_llm version 0.10.0 to convert checkpoints and compile model

hijkzzz commented 6 days ago

could you try pip install tensorrt_llm==0.11.0.dev2024061800?

HowardChenRV commented 4 days ago

could you try pip install tensorrt_llm==0.11.0.dev2024061800?

I've got the same issue with tensorrt_llm==0.11.0.dev2024061800

[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061800 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_output_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_output_len' not found E0624 09:58:34.138728 386 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found" E0624 09:58:34.138787 386 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found" I0624 09:58:34.138805 386 model_lifecycle.cc:776] "failed to load 'tensorrt_llm'" [TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default. [TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.

HowardChenRV commented 4 days ago

Update: That's my configuration about triton inference server. I wish I could be of some help~

python3 tools/fill_template.py --in_place \ triton_model_repo/tensorrt_llm/config.pbtxt \ engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,\ triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:guaranteed_no_evict

python tools/fill_template.py --in_place \ triton_model_repo/preprocessing/config.pbtxt \ tokenizer_dir:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat,\ tokenizer_type:sp,triton_max_batch_size:64,preprocessing_instance_count:1,add_special_tokens:True

python tools/fill_template.py --in_place \ triton_model_repo/postprocessing/config.pbtxt \ tokenizer_dir:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat,\ tokenizer_type:sp,triton_max_batch_size:64,postprocessing_instance_count:1

python tools/fill_template.py --in_place \ triton_model_repo/ensemble/config.pbtxt \ triton_max_batch_size:64

python tools/fill_template.py --in_place \ triton_model_repo/tensorrt_llm_bls/config.pbtxt \ triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False