[FeatureRequest] Gather sparse logprobs

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.11k stars 896 forks source link

[FeatureRequest] Gather sparse logprobs #1419

Open Marks101 opened 5 months ago

Marks101 commented 5 months ago

Hello team,

We typically use gather_all_token_logits to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example, when running inference loads with input and output lengths of 1024 and a batch size of 32, the collected logit tensor requires 32 GB of memory (fp32).

In vllm it is possible to collect only the topk logprobs (see here). This is much more memory efficient and would be sufficient for our purposes. Is there currently a way to do this in TensorRT-LLM as well? If not, we would really appreciate this feature in both ModelRunner and ModelRunnerCpp.

This issue is somehow related to https://github.com/NVIDIA/TensorRT-LLM/issues/1040, as it would be possible to solve it on our side if it is possible to collect arbitrary model outputs.

Thank you

Marks101 commented 4 months ago

@byshiue @ncomly-nvidia we figured that this feature could be implemented on our side based on a LogitsProcessor. But currently these are not supported by the ModelRunnerCpp / tensorrt_llm.bindings.GptSession: https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/tensorrt_llm/runtime/model_runner_cpp.py#L310-L312 Is there any plan to extend the support?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

MartinMarciniszyn commented 3 months ago

@Marks101, the logits processor is supported on ModelRunnerCppExecutor: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48 Could you try that please?

Marks101 commented 3 months ago

Hi @MartinMarciniszyn thank you for the update. We will take a look at this 😃

shangshng commented 3 months ago

@Marks101, the logits processor is supported on ModelRunnerCppExecutor: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/model_runner_cpp.py#L48 Could you try that please?

Hello, it looks like logits processor is disabled here....

MartinMarciniszyn commented 3 months ago

Thanks for the feedback @shangshng. It should be support in the Python bindings of the Executor API. @dcampora, could you please add support to ModelRunnerCpp?

@Marks101, you can use the Executor API directly instead of going through ModelRunnerCpp for now.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."