Open HowardChenRV opened 1 week ago
Update: used tensorrt_llm version 0.10.0 to convert checkpoints and compile model
could you try pip install tensorrt_llm==0.11.0.dev2024061800
?
could you try
pip install tensorrt_llm==0.11.0.dev2024061800
?
I've got the same issue with tensorrt_llm==0.11.0.dev2024061800
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061800 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_output_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_output_len' not found E0624 09:58:34.138728 386 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found" E0624 09:58:34.138787 386 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found" I0624 09:58:34.138805 386 model_lifecycle.cc:776] "failed to load 'tensorrt_llm'" [TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default. [TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
Update: That's my configuration about triton inference server. I wish I could be of some help~
python3 tools/fill_template.py --in_place \ triton_model_repo/tensorrt_llm/config.pbtxt \ engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,\ triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:guaranteed_no_evict
python tools/fill_template.py --in_place \ triton_model_repo/preprocessing/config.pbtxt \ tokenizer_dir:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat,\ tokenizer_type:sp,triton_max_batch_size:64,preprocessing_instance_count:1,add_special_tokens:True
python tools/fill_template.py --in_place \ triton_model_repo/postprocessing/config.pbtxt \ tokenizer_dir:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat,\ tokenizer_type:sp,triton_max_batch_size:64,postprocessing_instance_count:1
python tools/fill_template.py --in_place \ triton_model_repo/ensemble/config.pbtxt \ triton_max_batch_size:64
python tools/fill_template.py --in_place \ triton_model_repo/tensorrt_llm_bls/config.pbtxt \ triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
System Info
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
git clone https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM cd examples/qwen pip install -r requirements.txt python convert_checkpoint.py --model_dir /share/datasets/public_models/Qwen_Qwen1.5-72B-Chat/ \ --output_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/checkpoint/qwen1.5_72b_chat_tllm_checkpoint_4gpu_tp4 \ --dtype float16 \ --tp_size 4 \ --workers 4
trtllm-build --checkpoint_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/checkpoint/qwen1.5_72b_chat_tllm_checkpoint_4gpu_tp4 \ --output_dir /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/Qwen_Qwen1.5-72B-Chat_TP4/ \ --gemm_plugin float16 \ --workers 4
docker run --rm -it --gpus all --net host --shm-size=2g \ --ulimit stack=67108864 \ -v /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines/qwen/Qwen_Qwen1.5-72B-Chat_TP4/tensorrtllm_backend:/tensorrtllm_backend \ -v /share/datasets/public_models/Qwen_Qwen1.5-72B-Chat:/share/datasets/public_models/Qwen_Qwen1.5-72B-Chat \ nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo
Expected behavior
I would expect the tensorrt engine to work with the triton inference server
actual behavior
additional notes
Triton Inference Server used : nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Model used: Qwen_Qwen1.5-72B-Chat