NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

InferenceRequest::serialize does not handle logits post processor, log an error #1771

Open DreamGenX opened 3 months ago

DreamGenX commented 3 months ago

System Info

When using C++ GptManager, I was using InferenceRequest::serialize to synchronize requests between ranks, not realizing that it does not handle logits post processors.

It would be great to log an error in InferenceRequest::serialize when logits post processor is not nullopt. Or, at the very least, document it in code and the readme

Here are some issues this resulted in:

Who can help?

@ju

Information

Tasks

Reproduction

N/A

Expected behavior

N/A

actual behavior

N/A

additional notes

N/A

nv-guomingz commented 3 months ago

Hi @MartinMarciniszyn would u plz take a look this ticket?

MartinMarciniszyn commented 3 months ago

Hi @DreamGenX, have you tried using the new Executor API instead? Requests of this API will serialize the name of the associated logits post-processor correctly.

DreamGenX commented 3 months ago

Hi @MartinMarciniszyn, thanks for your time. Comment on that below, but I would say that adding documentation and runtime warning for batch manager's InferenceRequest::serialize still makes sense, regardless of the viability of Executor API.

Regarding Executor API:

Executor API is unfortunately limited as far as I can tell -- I don't see a way to have "dynamic" post-processors where the post-processors differ per request and all possible valuas aren't know apriori (so can't be registred in the map).

With the executor API, since all post-processors need to be pre-registered, there's no away to achieve this as far as I can tell. We can't even store the state in some external storage keyed by requestId, since the id isn't know before calling enqueueRequests.

Possible solution: Let LogitPostProcessor access arbitrary std::vector<uint8_t> state that we provide when building the Request / SamplingParams, there we can encode additional per-request state for the logit post processor. A more flexible version would be to actually have LogitPostProcessorFactory that returns LogitPostProcessor based on the state, as it could avoid expensive pre-processing on each token in some cases.

This is the main reason I have essentially re-implemented the Executor C++ API myself (the code is not open-source, so I could not just modify it either). So if the Executor API can be made more flexible, I would love it.

See also here: https://github.com/NVIDIA/TensorRT-LLM/issues/1680#issuecomment-2137826264

akhoroshev commented 3 months ago

This is the main reason I have essentially re-implemented the Executor C++ API myself (the code is not open-source, so I could not just modify it either). So if the Executor API can be made more flexible, I would love it.

@DreamGenX

Hi, do you have any plans to open source the your own "Executor"?

DreamGenX commented 3 months ago

Hi @akhoroshev -- I would hope it soon becomes obsolete once the Executor API becomes more mature, but if not I don't mind releasing it.

akhoroshev commented 3 months ago

Hi @akhoroshev -- I would hope it soon becomes obsolete once the Executor API becomes more mature, but if not I don't mind releasing it.

@DreamGenX

The trtllm team's core idea of ​​not opening the source code (Executor/BatchManager) makes it very difficult to customize functionality.

There is a list of features that cannot be implemented due to closed sources.

Therefore, I would be very grateful if you could share the sources of your executor.

DreamGenX commented 3 months ago

@MartinMarciniszyn can you let us know if you plan to expose more flexible API for logit post processors in the Executor?

MartinMarciniszyn commented 3 months ago

Thank you for your feedback, @DreamGenX.

I agree that there should an error message for InferenceRequest::serialize. We will look into it.

Serializing arbitrary function callbacks between the different ranks poses technical difficulties. The team will investigate how we can add more flexibility to the API given these restrictions.

DreamGenX commented 3 months ago

@MartinMarciniszyn Yeah, of course, serialization of arbitrary callbacks is not feasible, that's why I propose that we can attach arbitrary binary data as extension to the Request / SamplingParams, and then get access to this data in the LogitPostProcessor. It would be even better if we can have LogitPostProcessorFactory, which gets called at the start of the request and can do all the pre-processing work (e.g. deserialization of the data, etc.) only once, instead of on every token.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."