TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
It seems like executor API ignores prompt_vocab_size argument and passes max_prompt_embedding_table_size to trt engine instead.
I observe such behaviour using either 0.10.0 python api (well, ModelRunnerCpp to be precise) or 0.9.0 (and 0.10.0) triton requests, but not 0.9.0 python api, so I assume the issue is in Executor API.
Who can help?
No response
Information
[ ] The official example scripts
[x] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)
Reproduction
1) Define fake model, which exposes provided prompt_vocab_size via logits and config for this model
fake_model.py:
from typing import Optional
import numpy as np
from tensorrt_llm.functional import (
Tensor,
cast,
constant,
unsqueeze,
)
from tensorrt_llm.models.modeling_utils import PretrainedConfig, PretrainedModel
class FakeTransformer(object):
def __init__(self):
self.vocab_embedding = None
class FakeModel(PretrainedModel):
def __init__(self, config: PretrainedConfig):
super().__init__(config)
self.transformer = FakeTransformer()
def forward(
self,
input_ids: Tensor,
position_ids=None,
use_cache=False,
last_token_ids=None,
attention_mask=None,
kv_cache_params=None,
attention_params=None,
hidden_states=None,
prompt_embedding_table: Optional[Tensor] = None,
prompt_tasks: Optional[Tensor] = None,
prompt_vocab_size: Optional[Tensor] = None,
lora_params=None,
medusa_position_offsets=None,
medusa_packed_mask=None,
):
assert prompt_embedding_table is not None
zero = constant(np.zeros((1, 1), dtype=self.config.dtype))
# [1, vocab_size]
zeros = constant(np.zeros((1, self.config.vocab_size), dtype=self.config.dtype))
# repeat_interleave only supports int repeats, so we use addition + broadcasting instead
# [len(input_ids), vocab_size]
zeros_repeated = zeros + cast(unsqueeze(input_ids, 1), self.config.dtype) * zero
# fake_logits used to expose prompt_vocab_size value
fake_logits = zeros_repeated + cast(unsqueeze(unsqueeze(prompt_vocab_size, 0), 0), self.config.dtype)
fake_logits.mark_output("logits")
return fake_logits
System Info
It seems like executor API ignores
prompt_vocab_size
argument and passesmax_prompt_embedding_table_size
to trt engine instead.I observe such behaviour using either 0.10.0 python api (well, ModelRunnerCpp to be precise) or 0.9.0 (and 0.10.0) triton requests, but not 0.9.0 python api, so I assume the issue is in Executor API.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1) Define fake model, which exposes provided prompt_vocab_size via logits and config for this model
fake_model.py:
config.json
2) build engine
3) Run the engine via python api:
Expected behavior
context logits are all
7.0
(length ofprompt_embedding_table
).actual behavior
context logits are all
20.0
(max_prompt_embedding_table_size
)additional notes
Debugging confirms, that request passed to executor via
enqueue_requests
call contains correct .prompt_tuning_config
-embedding_table
of shapeIf we replace
prompt_vocab_size
withshape(prompt_embedding_table, 1)
infake_model.py
- result is the same.