Open Marks101 opened 5 months ago
@byshiue @ncomly-nvidia we figured that this feature could be implemented on our side based on a LogitsProcessor
. But currently these are not supported by the ModelRunnerCpp
/ tensorrt_llm.bindings.GptSession
:
https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/tensorrt_llm/runtime/model_runner_cpp.py#L310-L312
Is there any plan to extend the support?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
@Marks101, the logits processor is supported on ModelRunnerCppExecutor
:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48
Could you try that please?
Hi @MartinMarciniszyn thank you for the update. We will take a look at this 😃
@Marks101, the logits processor is supported on
ModelRunnerCppExecutor
: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48 Could you try that please?
Hello, it looks like logits processor is disabled here....
Thanks for the feedback @shangshng. It should be support in the Python bindings of the Executor API. @dcampora, could you please add support to ModelRunnerCpp
?
@Marks101, you can use the Executor API directly instead of going through ModelRunnerCpp
for now.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Hello team,
We typically use
gather_all_token_logits
to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example, when running inference loads with input and output lengths of 1024 and a batch size of 32, the collected logit tensor requires 32 GB of memory (fp32).In vllm it is possible to collect only the topk logprobs (see here). This is much more memory efficient and would be sufficient for our purposes. Is there currently a way to do this in TensorRT-LLM as well? If not, we would really appreciate this feature in both
ModelRunner
andModelRunnerCpp
.This issue is somehow related to https://github.com/NVIDIA/TensorRT-LLM/issues/1040, as it would be possible to solve it on our side if it is possible to collect arbitrary model outputs.
Thank you