NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.35k stars 939 forks source link

generation does not stop correctly #121

Open thendwk opened 11 months ago

thendwk commented 11 months ago

model:codellama/CodeLlama-7b-Python-hf build code: python build.py --model_dir /docker_storage/CodeLlama-7b-Python-hf/ --dtype float16 \ --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 \ --enable_context_fmha --output_dir codellama_7b --rotary_base 1000000 --vocab_size 32016 run code: python run.py --max_output_len=512 --tokenizer_dir /docker_storage/CodeLlama-7b-Python-hf/ --engine_dir codellama_7b --input_text "# language: Python\n# write a bubble sort function\n" output: `Output: "\ndef bubble_sort(array):\n n = len(array)\n for i in range(n):\n swapped = False\n for j in range(0, n-i-1):\n if array[j] > array[j + 1]:\n array[j], array[j + 1] = array[j + 1], array[j]\n swapped = True\n if swapped == False:\n return array\n\n\n# test the function\narray = [6, 20, 8, 19, 56, 23, 87, 49, 41, 54]\nprint(bubble_sort(array))" }

def test_get_code_snippet_from_file(self):
    code_snippet = get_code_snippet_from_file(self.file_path)
    self.assertEqual(code_snippet, self.code_snippet)

def test_get_code_snippet_from_file_with_invalid_file_path(self):
    with self.assertRaises(FileNotFoundError):
        get_code_snippet_from_file("invalid_file_path")

def test_get_code_snippet_from_file_with_invalid_file_extension(self):
    with self.assertRaises(ValueError):
        get_code_snippet_from_file("invalid_file_extension.txt")

def test_get_code_snippet_from_file_with_invalid_file_content(self):
    with self.assertRaises(ValueError):
        get_code_snippet_from_file(self.invalid_file_path)

def test_get_code_snippet_from_file_with_invalid_file_content_with_no_code(self):
    with self.assertRaises(ValueError):
        get_code_snippet_from_file(self.invalid_file_path_with_no_code)

def"`

expected result: "\ndef bubble_sort(array):\n n = len(array)\n for i in range(n):\n swapped = False\n for j in range(0, n-i-1):\n if array[j] > array[j + 1]:\n array[j], array[j + 1] = array[j + 1], array[j]\n swapped = True\n if swapped == False:\n return array\n\n\n# test the function\narray = [6, 20, 8, 19, 56, 23, 87, 49, 41, 54]\nprint(bubble_sort(array))"

so, the generation seems does not stop correctly.

jdemouth-nvidia commented 11 months ago

There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (main) branch at the end of this week. Sorry for the inconvenience.

thendwk commented 11 months ago

There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (main) branch at the end of this week. Sorry for the inconvenience.

ok,thx for the quick response

gesanqiu commented 11 months ago

Have you been test the same prompt with transformers or other LLM inference frameworks? It seems endless output is a normal issue for codellama based models, I have been test codellama-7b/13b/33b and my own SFT codellama models, all of them exist the endless output problem.

@jdemouth-nvidia Just for confirming, I know TRT_LLM will pre-allocate a self.output_ids buffer and prefill it will EOS_TOKEN, and the output of run.py shows that it won't cut of the extra tokens even the generation is stopped by the EOS_TOKEN.

decode time cost: 0.9818031787872314
Output ids: tensor([[[  518, 25580, 29962, 29871, 30919, 31076,   518, 29914, 25580, 29962,
          29871, 15043, 29991,   334,  3844,  5475, 29930,   739, 29915, 29879,
           7575,   304,  5870,   366, 29889,  1317,   727,  1554,   306,   508,
           1371,   366,   411,   470,   723,   366,   763,   304, 13563, 29973,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
              2,     2,     2,     2,     2,     2,     2,     2]]],
       device='cuda:0', dtype=torch.int32)
Input: "[INST] 你好 [/INST]"
Output: "  Hello! *smiles* It's nice to meet you. Is there something I can help you with or would you like to chat?</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>"

But even I set the max_output_len to different value from 128 to 1024, the time cost of decoder.decode() won't increse to much(the differenceis just in ten ms level), since the outputs are just the same except the filled EOS_TOKEN, so I think the number of generate token is fixed? So what do you mean "don’t pass the proper end token"?

jdemouth-nvidia commented 11 months ago

@gesanqiu , that's correct, the buffer is pre-filled and run.py shows the entire sequence without stoping to the 1st EOS token. We plan to fix that in the future.

thendwk commented 11 months ago

There’s a bug. We don’t pass the proper end token, that is going to be fixed when we push an update to the dev (main) branch at the end of this week. Sorry for the inconvenience.

hi jdemouth,i use the latest code in main branch, this problem seems still existed. The generation doesn't stop correctly.

image

sc-gr commented 11 months ago

I was facing this issue as well. I replicated the behavior of adding "" until the max_token when using run.py.

I also used the same engine and trt-llm backend to set up a triton server. When I use curl -X POST http://localhost/v2/models/ensemble/generate -d '{"text_input": "..", "max_tokens": 128, "top_k":1} to generate text, it outputs unrelated text after what the generation should be, until the max token number specified. I'm not sure if this is also due to this TRT-LLM bug. I'm currently able to pass an "end_id" explicitly in the curl request to avoid this issue though.

Linzecong commented 10 months ago

Any update here? Same problem last for few weeks.

spongezz commented 10 months ago

Any update here? Same problem last for few weeks.

Try the latest main?

abacaj commented 10 months ago

This isn't fixed yet, still can't stop generating when passing eos_token

VeryVery commented 10 months ago

Mark

byshiue commented 10 months ago

Please take a try on latest main branch. Here is a document to demonstrate how to use stop word in triton https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md. Hoping it is helpful.