NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

mApplyLogitsPostProcessorBatched parameter for batch_manager::GenericLlmRequest #1801

Closed akhoroshev closed 1 week ago

akhoroshev commented 1 week ago

batch_manager::GenericLlmRequest has logitsPostProcessor with type std::function<void(RequestIdType, TensorPtr&, BeamTokens const&, TStream const&)>; and mApplyLogitsPostProcessorBatched option

How can this type of callback handle a batch of requests?

hijkzzz commented 1 week ago

@MartinMarciniszyn @Shixiaowei02 Could you please help to comment on this issue? Thanks.

MartinMarciniszyn commented 1 week ago

The batched logits postprocessor has this signature:

using LogitsPostProcessorBatched = std::function<void(std::vector<IdType> const&, std::vector<Tensor>&,
    std::vector<std::reference_wrapper<BeamTokens const>> const&, StreamPtr const&)>;

See types.h for details.

akhoroshev commented 1 week ago

@MartinMarciniszyn batch_manager does not accept such signature. Your link from Executor API not from batch_manager API.

MartinMarciniszyn commented 1 week ago

Setting the batched logits processor is not exposed on GptManager. Please use the Executor API for this functionality.