Open jxchenus opened 19 hours ago
I'm only able to repro using Triton Server Python backend with ModelRunnerCpp.
But the fix is pretty straight-forward: Just need to move the few problematic lines (https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800) after vocab_size is assigned.
Here's a patch which fixes the issue:
could you please provide a simple reproducer for this issue?
We would of course be happy to include your fixes and credit you appropriately, but we need to be able to reproduce the issue.
Thank you!
System Info
TensorRT-LLM v0.13.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The error is thrown from: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800
Expected behavior
Should first take the vocab size from the logits as done in here
actual behavior
The error is thrown from: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800
Here's the stack: Traceback (most recent call last): File "/opt/amazon/alexa_triton_inference_engine/lib/python3.10/site-packages/nemort_triton_trtllm_inference_server/models/agm/model.py", line 248, in execute outputs = self.runner.generate( File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 606, in generate return self._initialize_and_fill_output( File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 678, in _initialize_and_fill_output return self._fill_output(responses, output_ids, end_id, return_dict, File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 800, in _fill_output gen_shape = (num_beams, max_new_tokens, vocab_size)
additional notes
N/A.