triton 部署，生成乱码

maozixi1 commented 7 months ago

使用的模型：Qwen-1.5-14B-Chat-GPTQ-Int4 环境：tensorrt-llm0.7.0、torch==2.1.0，triton=2.1.0, transformers=4.39.1 编译命令( qwen2目录下的 )：python build.py --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \ --hf_model_dir Qwen1.5-14B-Chat-GPTQ-Int4 \ --quant_ckpt_path Qwen1.5-14B-Chat-GPTQ-Int4 \ --world_size=2 \ --tp_size=2 \ --use_inflight_batching \ --paged_kv_cache 部署命令：python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo

执行：curl -X POST localhost:8000/v2/models/ensemble/generate \

-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好，你叫什么？<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645]}'

输出：{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"可见�� Jan时 lordDI现象 E蹀全体 manage眼看Smarteniable这张 pont可顺 Community在职另一当时can统一。 back自制-形成的一声 Jan时候 Die两项产生了醌Can这张-这张　管理就有受到第三辛苦一笔 no社区.\总一分角落-D (;;)Ending心血提交油烟痞报告 Bronze白色的鬈管理特别Can仅受到了特别这张 commun挟存在一定时代另一agoon以上的公认的管理 reports两地 reportsthat manage的一个 no欲望物 jan士intent幻想 DP呱eniable INCIDENT管理十五的"}

请问一下，这个是什么原因

Tlntin commented 7 months ago

换TensorRT-LLM 0.8.0部署

maozixi1 commented 7 months ago

换TensorRT-LLM 0.8.0部署谢谢回复，我在按0.8.0部署后，问答报错：执行：curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好，你叫什么？<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151643], "pad_id": [151643], "stream": true}' 回答： data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}

data: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""} 请问是什么原因。

Tlntin commented 7 months ago

部署的配置文件是否用的本项目的triton_model_repo目录。

maozixi1 commented 7 months ago

部署的配置文件是否用的本项目的triton_model_repo目录。是的，就是设置环境变量时报错我就自己手动修改了，剩下的都一样。 export HF_QWEN_MODEL="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_14b_chat_int4" export ENGINE_DIR="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1" export MAX_BATCH_SIZE=2 export TOKENIZE_TYPE=auto export INSTANCE_COUNT=4 export GPU_DEVICE_IDS=0,1

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt \ triton_max_batch_size:${MAX_BATCH_SIZE},\ decoupled_mode:True,max_beam_width:1,\ engine_dir:${ENGINE_DIR},exclude_input_in_output:True,\ enable_kv_cache_reuse:False,batching_strategy:inflight_batching,\ max_queue_delay_microseconds:600

Tlntin commented 7 months ago

其他接口正常吗？非stream模式呢？也可以试试单卡编译，是否正常。

maozixi1 commented 7 months ago

File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py", line 320, in execute raise pb_utils.TritonModelException( c_python_backend_utils.TritonModelException: Model tensorrt_llm - Error when running inference: Model tensorrt_llm is using the decoupled. The current BLS request call doesn't support models using the decoupled transaction policy. Please use 'decoupled=True' argument to the 'exec' or 'async_exec' calls for decoupled models.' 外部调用时的报错，不知是否相关

maozixi1 commented 7 months ago

非stream，也是一样 curl -X POST localhost:8000/v2/models/ensemble/generate \

-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好，你叫什么？<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151643], "pad_id": [151643]}' {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}

Tlntin commented 7 months ago

话说你run.py正常吗，先确定engine能run

maozixi1 commented 7 months ago

话说你run.py正常吗，先确定engine能run

好像确实不能run，我在重新编译一下

maozixi1 commented 7 months ago

话说你run.py正常吗，先确定engine能run

你好，我又重新编译了一下，还是run不了（单卡编译的可以，多卡不行）： [TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last): File "/root/examples/qwen2/run.py", line 511, in main(args) File "/root/examples/qwen2/run.py", line 386, in main runner = runner_cls.from_dir(*runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 139, in from_dir world_config = WorldConfig.mpi(tensor_parallelism=tp_size, RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp pp (/home/jenkins/agent/workspace/LLM/release-0.8/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:90) 好像是tp的问题，不过我看 run.py 也没有tp的参数。执行：mpirun -n 2 --allow-run-as-root python3 run.py 也不行

liyunhan commented 7 months ago

@Tlntin 我自己试过，qwen1在streaming模式下会出现这个问题，非stream模式正常。

Tlntin commented 7 months ago

@Tlntin 我自己试过，qwen1在streaming模式下会出现这个问题，非stream模式正常。

qwen1用0.7.0测试的？

liyunhan commented 7 months ago

@Tlntin 是的。整个流程是按照您的doc进行，在python端用grpc调用，stream模式下会出现��。

Tlntin commented 7 months ago

@Tlntin 是的。整个流程是按照您的doc进行，在python端用grpc调用，stream模式下会出现��。

测试案例能分享吗？

liyunhan commented 7 months ago

@Tlntin 我打印过stream和非stream下decode后的内容，发现对应位置的token_id是不一样的

liyunhan commented 7 months ago

测试1】测试2

` user_data = UserData() triton_client.start_stream(callback=partial(callback, user_data)) triton_client.async_stream_infer('tensorrt_llm', inputs, request_id=st.session_state.request_id) triton_client.stop_stream()

    with st.chat_message("assistant"):
        output_text = ""
        placeholder = st.empty()
        while True:
            try:
                result = user_data._completed_requests.get(block=False)
            except Exception:
                break
            if type(result) == InferenceServerException:
                print("Received an error from server:")
                print(result)
            else:
                output = result.as_numpy('output_ids')
                if STREAMING:
                    output_text += tokenizer.decode(output[0][0], skip_special_tokens=True)
                    placeholder.markdown(output_text)
                    time.sleep(0.01)
                else:
                    output_text = tokenizer.decode(output[0][0], skip_special_tokens=True)
                    st.markdown(output_text)`

Tlntin commented 7 months ago

噢，你是用tritonserver接入langchain，这里需要处理一下单个token的逻辑。如果单个token为乱码，就需要等下一个token才能解码。参考代码

liyunhan commented 7 months ago

@Tlntin 多的不说了！再给您磕一个！

Tlntin / Qwen-TensorRT-LLM

triton 部署，生成乱码 #101

Tlntin / Qwen-TensorRT-LLM

triton 部署， 生成乱码 #101

triton 部署，生成乱码 #101