NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

How to pass hidden_states to llm directly, when using inflight batching? #1488

Open JoursBleu opened 6 months ago

JoursBleu commented 6 months ago

Is there any method to pass hidden_states to llm directly, when using inflight batching?

For example:

In multimodal case, the image feature embedding is done by vision_tower and projector.

Generally, we can pass these hidden_states with "prompt_table" param.

But it seems the "GenerationRequest" does not have a "prompt_table" attribute...

How to passing these image feature hidden_states to the llm?

MartinMarciniszyn commented 6 months ago

We do not have support in the runtime for that at the moment. Is this something that could be handled inside the engine, @QiJune ?

baby-care commented 5 months ago

We do not have support in the runtime for that at the moment. Is this something that could be handled inside the engine, @QiJune ?

@MartinMarciniszyn Is it support pass hidden_states when using python model_runner.py

littletomatodonkey commented 5 months ago

It seems that prompt_table_path exsits in InferenceRequest , maybe you can have a look, i'll take a try recently.

https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/batch_manager/inferenceRequest.cpp#L141