Closed liyu-tan closed 11 months ago
@liyu-tan Inflight batching should already support stop_words. Can you share your concrete steps of reproducing your issue?
June
I also found the problem in codeLlama in cpp runtime when I use the inflight batch with tritonserver https://github.com/NVIDIA/TensorRT-LLM/issues/90. Is the problem in inflight batch module? I could not see the code.
Hi @yoyopdc ,
We need a proper way to reproduce the issue using the main
branch. Can you share that with us, please? Without a proper reproducer, we cannot investigate the issue.
Thanks, Julien
Yes I'm willing to reproduce the issue, but the issue is happened in tensorrt-backend + tensorrt-LLM, I had download the main branch but I met some problem about compiling the tensorrt-backend with tensorrt-LLM:main, After I fix it I will reproduce it. PS: I can't bear the long build time, if you know some way to build the project fast Can you share. I had 256 threads but it still so slow
This issue is related: https://github.com/triton-inference-server/tensorrtllm_backend/issues/57
Inflight batching should already support stop_words.
The STOP_WORD_IDS
are computed from stop_words
inpreprocessor
, but are neither fetched from there nor passed to tensorrt_llm
. I don't know though if they are not passed by mistake or the backend itself doesn't support them as input yet.
To my knowledge, the only workaround currently is to send end_id
explicitly, but this way you can only specify one stop token.
PS: Given that inputs of tensorrt_llm
don't include anything similar to stop tokens, they are most probably not supported by the backend.
This issue is related: triton-inference-server/tensorrtllm_backend#57
Inflight batching should already support stop_words.
The
STOP_WORD_IDS
are computed fromstop_words
inpreprocessor
, but are neither fetched from there nor passed totensorrt_llm
. I don't know though if they are not passed by mistake or the backend itself doesn't support them as input yet.To my knowledge, the only workaround currently is to send
end_id
explicitly, but this way you can only specify one stop token.PS: Given that inputs of
tensorrt_llm
don't include anything similar to stop tokens, they are most probably not supported by the backend. Yes ,I also use end_id now!
Do you still encounter issue on latest main branch?
Here is a new document to demonstrate how to use stop word in triton.
Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.
Hello, Where is the document? We are testing vanilla Llama 2 70B. with version 0.6.1. It ignores stop words like "." and "\n".
We think it might be related to this https://github.com/triton-inference-server/tensorrtllm_backend/issues/47
Hello, Where is the document? We are testing vanilla Llama 2 70B. with version 0.6.1. It ignores stop words like "." and "\n".
I'm not sure if this is a bug, but this is how I solved it.
you can change the model.py file and set the legacy to true. by default it is false.
self.tokenizer = LlamaTokenizer.from_pretrained(tokenizer_dir, legacy=True, padding_side='left')
Hi, I also encountered the same problem, I use tensorrt-llm release/0.5.0 and tensorrtllm-backend release/0.5.0 for building, and use 23.10-trtllm-python-py3 docker for serving. When my deployment is complete for chatglm3-6b,and request service as below:
(glm) aiadmin@zl-gpu03:~/LLM/tensorrtllm_backend-release-0.5.0$ python tools/inflight_batcher_llm/end_to_end_streaming_client.py -u "0.0.0.0:8001" -p "员工已成功提交了一个“请假”申请单,请用温馨的话语对用户进行回复" -S -o 100
FLAGS: Namespace(verbose=False, url='0.0.0.0:8001', prompt='员工已成功提交了一个“请假”申请单,请用温馨的话语对用户进行回复', streaming=True, protocol='grpc', output_len=100)
:
尊敬的用户,您已成功提交了请假申请,非常感谢您的配合!我们期待您的假期能够愉快、轻松,让您充分休息,放松身心。请您放心,我们会尽快处理您的请假申请,确保您的权益得到保障。再次感谢您的支持与理解!
您好!我是您的人工智能助手。请问有什么我可以帮您解答的问题吗?
您好!我是您的人工智能助手。请问有什么我可以帮您
It can not stop,until the response reach max_tokens, I want know how to set the stop words or other parameters. Could you please give me a detailed explanation? I've been troubled for a long time
Please take a try on latest main branch. It is fixed in latest branch. You also need to make sure you setup the stop_words correctly, you can print the stop word ids in preprocessor and compare them with output ids.
Still get the errors in TensorRT LLM 0.10.0 @byshiue
When I call the infligth batch, I want to add additional stop words, it seems it did not work. Is this a bug or the feature is not ready yet?