Closed maozixi1 closed 7 months ago
换TensorRT-LLM 0.8.0部署
换TensorRT-LLM 0.8.0部署 谢谢回复,我在按0.8.0部署后,问答报错: 执行:curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151643], "pad_id": [151643], "stream": true}' 回答: data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}
data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}
data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""} 请问是什么原因。
部署的配置文件是否用的本项目的triton_model_repo目录。
部署的配置文件是否用的本项目的triton_model_repo目录。 是的,就是设置环境变量时报错我就自己手动修改了,剩下的都一样。 export HF_QWEN_MODEL="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_14b_chat_int4" export ENGINE_DIR="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" export MAX_BATCH_SIZE=2 export TOKENIZE_TYPE=auto export INSTANCE_COUNT=4 export GPU_DEVICE_IDS=0,1
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt \ triton_max_batch_size:${MAX_BATCH_SIZE},\ decoupled_mode:True,max_beam_width:1,\ engine_dir:${ENGINE_DIR},exclude_input_in_output:True,\ enable_kv_cache_reuse:False,batching_strategy:inflight_batching,\ max_queue_delay_microseconds:600
其他接口正常吗?非stream模式呢?也可以试试单卡编译,是否正常。
File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py", line 320, in execute raise pb_utils.TritonModelException( c_python_backend_utils.TritonModelException: Model tensorrt_llm - Error when running inference: Model tensorrt_llm is using the decoupled. The current BLS request call doesn't support models using the decoupled transaction policy. Please use 'decoupled=True' argument to the 'exec' or 'async_exec' calls for decoupled models.' 外部调用时的报错,不知是否相关
非stream,也是一样 curl -X POST localhost:8000/v2/models/ensemble/generate \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151643], "pad_id": [151643]}' {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}
话说你run.py正常吗,先确定engine能run
话说你run.py正常吗,先确定engine能run
好像确实不能run,我在重新编译一下
话说你run.py正常吗,先确定engine能run
你好,我又重新编译了一下,还是run不了(单卡编译的可以,多卡不行):
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
File "/root/examples/qwen2/run.py", line 511, in
@Tlntin 我自己试过,qwen1在streaming模式下会出现这个问题,非stream模式正常。
@Tlntin 我自己试过,qwen1在streaming模式下会出现这个问题,非stream模式正常。
qwen1用0.7.0测试的?
@Tlntin 是的。整个流程是按照您的doc进行,在python端用grpc调用,stream模式下会出现��。
@Tlntin 是的。整个流程是按照您的doc进行,在python端用grpc调用,stream模式下会出现��。
测试案例能分享吗?
@Tlntin 我打印过stream和非stream下decode后的内容,发现对应位置的token_id是不一样的
` user_data = UserData() triton_client.start_stream(callback=partial(callback, user_data)) triton_client.async_stream_infer('tensorrt_llm', inputs, request_id=st.session_state.request_id) triton_client.stop_stream()
with st.chat_message("assistant"):
output_text = ""
placeholder = st.empty()
while True:
try:
result = user_data._completed_requests.get(block=False)
except Exception:
break
if type(result) == InferenceServerException:
print("Received an error from server:")
print(result)
else:
output = result.as_numpy('output_ids')
if STREAMING:
output_text += tokenizer.decode(output[0][0], skip_special_tokens=True)
placeholder.markdown(output_text)
time.sleep(0.01)
else:
output_text = tokenizer.decode(output[0][0], skip_special_tokens=True)
st.markdown(output_text)`
噢,你是用tritonserver接入langchain,这里需要处理一下单个token的逻辑。如果单个token为乱码,就需要等下一个token才能解码。参考代码
@Tlntin 多的不说了!再给您磕一个!
使用的模型:Qwen-1.5-14B-Chat-GPTQ-Int4 环境:tensorrt-llm0.7.0、torch==2.1.0,triton=2.1.0, transformers=4.39.1 编译命令( qwen2目录下的 ):python build.py --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \ --hf_model_dir Qwen1.5-14B-Chat-GPTQ-Int4 \ --quant_ckpt_path Qwen1.5-14B-Chat-GPTQ-Int4 \ --world_size=2 \ --tp_size=2 \ --use_inflight_batching \ --paged_kv_cache 部署命令:python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo
执行:curl -X POST localhost:8000/v2/models/ensemble/generate \
输出:{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"可见�� Jan时 lordDI现象 E蹀全体 manage眼看Smarteniable这张 pont可顺 Community在职另一当时can统一 。 back自制-形成的一声 Jan时候 Die两项产生了醌Can这张-这张 管理就有受到第三辛苦一笔 no社区.\总一分角落-D (;;)Ending心血提交油烟痞报告 Bronze白色的鬈管理特别Can仅受到了特别这张 commun挟存在一定时代另一agoon以上的公认的管理 reports两地 reportsthat manage的一个 no欲望物 jan士intent幻想 DP呱eniable INCIDENT管理十五 的"}
请问一下,这个是什么原因