NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.61k stars 979 forks source link

How to use prompt_table? #2048

Open Popsicle0-0 opened 3 months ago

Popsicle0-0 commented 3 months ago

What conditions need to be met when using a prompt_table? I am trying to convert minicpm_llama3_v2.5, and if I have a custom method to merge input_id and vit, where should this logic be applied? I found that GenerationSession seems to only accept input_ids as input, and when I used input_embeds as the only input, various issues arose. I want to try using prompt_table but i donlt know where to combine input_ids and vit output, have any suggestions?

Hukongtao commented 3 months ago

You can see how QwenVL accelerates, and then you may understand

Popsicle0-0 commented 3 months ago

You can see how QwenVL accelerates, and then you may understand

Thank you for your response. Yes, I’ve looked into how Qwen-VL implements the prompt_table, but I’m not sure if this approach is suitable for all multimodal models. Additionally, why do different models have different ways of generating the prompt_table? Where can I find reference information on this?

def ptuning_setup(self, prompt_table, dtype, hidden_size, tasks, input_ids): if prompt_table is not None: task_vocab_size = torch.tensor([prompt_table.shape[1]], dtype=torch.int32, device="cuda") prompt_table = prompt_table.view( (prompt_table.shape[0] * prompt_table.shape[1], prompt_table.shape[2])) prompt_table = prompt_table.cuda().to( dtype=tensorrt_llm._utils.str_dtype_to_torch(dtype)) else: prompt_table = torch.empty([1, hidden_size]).cuda() task_vocab_size = torch.zeros([1]).cuda() if tasks is not None: tasks = torch.tensor([int(t) for t in tasks.split(',')], dtype=torch.int32, device="cuda") assert tasks.shape[0] == input_ids.shape[ 0], "Number of supplied tasks must match input batch size" else: tasks = torch.zeros([input_ids.size(0)], dtype=torch.int32).cuda() return [prompt_table, tasks, task_vocab_size]

Hukongtao commented 3 months ago

I have the same feeling as you. https://github.com/NVIDIA/TensorRT-LLM/issues/2104

Popsicle0-0 commented 3 months ago

I have the same feeling as you. #2104

Could we have a short discussion? my email address is 1270660449@qq.com Thank you!

amukkara commented 2 months ago

@Popsicle0-0

prompt_table definition depends on the position of special <Image> tokens in prompt, which is model-specific. The idea is to split input ids into [pre_text_ids, prompt_table_ids, post_text_ids]. Some models skip either the pre_text or post_text component.