NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.33k stars 936 forks source link

ModelRunnerCpp throws UnboundLocalError: local variable 'vocab_size' referenced before assignment #2284

Open jxchenus opened 19 hours ago

jxchenus commented 19 hours ago

System Info

TensorRT-LLM v0.13.0

Who can help?

No response

Information

Tasks

Reproduction

The error is thrown from: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800

Expected behavior

Should first take the vocab size from the logits as done in here

actual behavior

The error is thrown from: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800

Here's the stack: Traceback (most recent call last): File "/opt/amazon/alexa_triton_inference_engine/lib/python3.10/site-packages/nemort_triton_trtllm_inference_server/models/agm/model.py", line 248, in execute outputs = self.runner.generate( File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 606, in generate return self._initialize_and_fill_output( File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 678, in _initialize_and_fill_output return self._fill_output(responses, output_ids, end_id, return_dict, File "/opt/amazon/alexa_triton_inference_engine/NeMoRT-TensorRT-LLM/tensorrt_llm/runtime/model_runner_cpp.py", line 800, in _fill_output gen_shape = (num_beams, max_new_tokens, vocab_size)

additional notes

N/A.

jxchenus commented 14 hours ago

I'm only able to repro using Triton Server Python backend with ModelRunnerCpp.

But the fix is pretty straight-forward: Just need to move the few problematic lines (https://github.com/NVIDIA/TensorRT-LLM/blob/v0.13.0/tensorrt_llm/runtime/model_runner_cpp.py#L795-L800) after vocab_size is assigned.

jxchenus commented 13 hours ago

Here's a patch which fixes the issue:

fb165e6..1e3fb80.diff.txt

DanBlanaru commented 1 hour ago

could you please provide a simple reproducer for this issue?

We would of course be happy to include your fixes and credit you appropriately, but we need to be able to reproduce the issue.

Thank you!