Open thendwk opened 1 year ago
There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (main
) branch at the end of this week. Sorry for the inconvenience.
There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (
main
) branch at the end of this week. Sorry for the inconvenience.
ok,thx for the quick response
Have you been test the same prompt with transformers or other LLM inference frameworks? It seems endless output is a normal issue for codellama based models, I have been test codellama-7b/13b/33b and my own SFT codellama models, all of them exist the endless output problem.
@jdemouth-nvidia Just for confirming, I know TRT_LLM will pre-allocate a self.output_ids
buffer and prefill it will EOS_TOKEN, and the output of run.py
shows that it won't cut of the extra tokens even the generation is stopped by the EOS_TOKEN.
decode time cost: 0.9818031787872314
Output ids: tensor([[[ 518, 25580, 29962, 29871, 30919, 31076, 518, 29914, 25580, 29962,
29871, 15043, 29991, 334, 3844, 5475, 29930, 739, 29915, 29879,
7575, 304, 5870, 366, 29889, 1317, 727, 1554, 306, 508,
1371, 366, 411, 470, 723, 366, 763, 304, 13563, 29973,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2]]],
device='cuda:0', dtype=torch.int32)
Input: "[INST] 你好 [/INST]"
Output: " Hello! *smiles* It's nice to meet you. Is there something I can help you with or would you like to chat?</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>"
But even I set the max_output_len
to different value from 128 to 1024, the time cost of decoder.decode()
won't increse to much(the differenceis just in ten ms level), since the outputs are just the same except the filled EOS_TOKEN, so I think the number of generate token is fixed? So what do you mean "don’t pass the proper end token"?
@gesanqiu , that's correct, the buffer is pre-filled and run.py
shows the entire sequence without stoping to the 1st EOS token. We plan to fix that in the future.
There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (
main
) branch at the end of this week. Sorry for the inconvenience.
hi jdemouth,i use the latest code in main branch, this problem seems still existed. The generation doesn't stop correctly.
I was facing this issue as well. I replicated the behavior of adding "" until the max_token when using run.py
.
I also used the same engine and trt-llm backend to set up a triton server. When I use curl -X POST http://localhost/v2/models/ensemble/generate -d '{"text_input": "..", "max_tokens": 128, "top_k":1}
to generate text, it outputs unrelated text after what the generation should be, until the max token number specified. I'm not sure if this is also due to this TRT-LLM bug. I'm currently able to pass an "end_id" explicitly in the curl request to avoid this issue though.
Any update here? Same problem last for few weeks.
Any update here? Same problem last for few weeks.
Try the latest main?
This isn't fixed yet, still can't stop generating when passing eos_token
Mark
Please take a try on latest main branch. Here is a document to demonstrate how to use stop word in triton https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md. Hoping it is helpful.
Hi @thendwk do u still have further issue or question now? If not, we'll close it soon.
model:codellama/CodeLlama-7b-Python-hf build code:
python build.py --model_dir /docker_storage/CodeLlama-7b-Python-hf/ --dtype float16 \ --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 \ --enable_context_fmha --output_dir codellama_7b --rotary_base 1000000 --vocab_size 32016
run code:python run.py --max_output_len=512 --tokenizer_dir /docker_storage/CodeLlama-7b-Python-hf/ --engine_dir codellama_7b --input_text "# language: Python\n# write a bubble sort function\n"
output: `Output: "\ndef bubble_sort(array):\n n = len(array)\n for i in range(n):\n swapped = False\n for j in range(0, n-i-1):\n if array[j] > array[j + 1]:\n array[j], array[j + 1] = array[j + 1], array[j]\n swapped = True\n if swapped == False:\n return array\n\n\n# test the function\narray = [6, 20, 8, 19, 56, 23, 87, 49, 41, 54]\nprint(bubble_sort(array))" }expected result:
"\ndef bubble_sort(array):\n n = len(array)\n for i in range(n):\n swapped = False\n for j in range(0, n-i-1):\n if array[j] > array[j + 1]:\n array[j], array[j + 1] = array[j + 1], array[j]\n swapped = True\n if swapped == False:\n return array\n\n\n# test the function\narray = [6, 20, 8, 19, 56, 23, 87, 49, 41, 54]\nprint(bubble_sort(array))"
so, the generation seems does not stop correctly.