NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.81k stars 1.01k forks source link

The stop words is not working #228

Closed liyu-tan closed 11 months ago

liyu-tan commented 1 year ago

When I call the infligth batch, I want to add additional stop words, it seems it did not work. Is this a bug or the feature is not ready yet?

juney-nvidia commented 1 year ago

@liyu-tan Inflight batching should already support stop_words. Can you share your concrete steps of reproducing your issue?

June

yoyopdc commented 1 year ago

I also found the problem in codeLlama in cpp runtime when I use the inflight batch with tritonserver https://github.com/NVIDIA/TensorRT-LLM/issues/90. Is the problem in inflight batch module? I could not see the code.

jdemouth-nvidia commented 1 year ago

Hi @yoyopdc ,

We need a proper way to reproduce the issue using the main branch. Can you share that with us, please? Without a proper reproducer, we cannot investigate the issue.

Thanks, Julien

yoyopdc commented 1 year ago

Yes I'm willing to reproduce the issue, but the issue is happened in tensorrt-backend + tensorrt-LLM, I had download the main branch but I met some problem about compiling the tensorrt-backend with tensorrt-LLM:main, After I fix it I will reproduce it. PS: I can't bear the long build time, if you know some way to build the project fast Can you share. I had 256 threads but it still so slow

konodyuk commented 1 year ago

This issue is related: https://github.com/triton-inference-server/tensorrtllm_backend/issues/57

Inflight batching should already support stop_words.

The STOP_WORD_IDS are computed from stop_words inpreprocessor, but are neither fetched from there nor passed to tensorrt_llm. I don't know though if they are not passed by mistake or the backend itself doesn't support them as input yet.

To my knowledge, the only workaround currently is to send end_id explicitly, but this way you can only specify one stop token.

PS: Given that inputs of tensorrt_llm don't include anything similar to stop tokens, they are most probably not supported by the backend.

yoyopdc commented 1 year ago

This issue is related: triton-inference-server/tensorrtllm_backend#57

Inflight batching should already support stop_words.

The STOP_WORD_IDS are computed from stop_words inpreprocessor, but are neither fetched from there nor passed to tensorrt_llm. I don't know though if they are not passed by mistake or the backend itself doesn't support them as input yet.

To my knowledge, the only workaround currently is to send end_id explicitly, but this way you can only specify one stop token.

PS: Given that inputs of tensorrt_llm don't include anything similar to stop tokens, they are most probably not supported by the backend. Yes ,I also use end_id now!

byshiue commented 1 year ago

Do you still encounter issue on latest main branch?

byshiue commented 11 months ago

Here is a new document to demonstrate how to use stop word in triton.

Close this bug because the issue is inactivated. Feel free to ask here if you still have question/issue, we will reopen the issue.

yessenzhar commented 11 months ago

Hello, Where is the document? We are testing vanilla Llama 2 70B. with version 0.6.1. It ignores stop words like "." and "\n".

yessenzhar commented 11 months ago

We think it might be related to this https://github.com/triton-inference-server/tensorrtllm_backend/issues/47

Linzecong commented 11 months ago

Hello, Where is the document? We are testing vanilla Llama 2 70B. with version 0.6.1. It ignores stop words like "." and "\n".

I'm not sure if this is a bug, but this is how I solved it.

you can change the model.py file and set the legacy to true. by default it is false.

https://github.com/triton-inference-server/tensorrtllm_backend/blob/3a61c37afcdc3d5d04796a89555605e713494031/all_models/gpt/preprocessing/1/model.py#L47

self.tokenizer = LlamaTokenizer.from_pretrained(tokenizer_dir, legacy=True, padding_side='left')
Rockie-Liu commented 8 months ago

Hi, I also encountered the same problem, I use tensorrt-llm release/0.5.0 and tensorrtllm-backend release/0.5.0 for building, and use 23.10-trtllm-python-py3 docker for serving. When my deployment is complete for chatglm3-6b,and request service as below:

(glm) aiadmin@zl-gpu03:~/LLM/tensorrtllm_backend-release-0.5.0$ python tools/inflight_batcher_llm/end_to_end_streaming_client.py -u "0.0.0.0:8001" -p "员工已成功提交了一个“请假”申请单,请用温馨的话语对用户进行回复" -S -o 100
FLAGS: Namespace(verbose=False, url='0.0.0.0:8001', prompt='员工已成功提交了一个“请假”申请单,请用温馨的话语对用户进行回复', streaming=True, protocol='grpc', output_len=100)
:

尊敬的用户,您已成功提交了请假申请,非常感谢您的配合!我们期待您的假期能够愉快、轻松,让您充分休息,放松身心。请您放心,我们会尽快处理您的请假申请,确保您的权益得到保障。再次感谢您的支持与理解!

您好!我是您的人工智能助手。请问有什么我可以帮您解答的问题吗?
您好!我是您的人工智能助手。请问有什么我可以帮您

It can not stop,until the response reach max_tokens, I want know how to set the stop words or other parameters. Could you please give me a detailed explanation? I've been troubled for a long time

byshiue commented 8 months ago

Please take a try on latest main branch. It is fixed in latest branch. You also need to make sure you setup the stop_words correctly, you can print the stop word ids in preprocessor and compare them with output ids.

zengrh3 commented 3 months ago

Still get the errors in TensorRT LLM 0.10.0 @byshiue