NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.79k stars 1.01k forks source link

Does TensorRT-LLM support passing input_embeds directly? #2104

Open Hukongtao opened 3 months ago

Hukongtao commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

DefTruth commented 3 months ago

same question

Hukongtao commented 3 months ago

same question

哇 大佬! 知乎还关注了你

Oldpan commented 3 months ago

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

DefTruth commented 3 months ago

是的,目前是只能采用这种实现。不过这种方式还是有点问题,就是对于传input emb,当你需要使用penalty的时候,transformers实际只会考虑output ids做惩罚,比如repetition_penalty。而trtllm通过传input ids做推理,实际上会考虑input ids+output ids一起做penalty, 从而会导致,在两边都用penalty的情况下,输出结果与trn的无法对齐。

Hukongtao commented 3 months ago

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

DefTruth commented 3 months ago

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

其实支持直接传input emb就简单多了

Oldpan commented 3 months ago

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊

DefTruth commented 3 months ago

我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>标志位,怎么使用pre + img + post这种方式传呢?

多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊

trn -> transformers, 偷懒,缩写

qism commented 3 months ago

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

Oldpan commented 3 months ago

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

我猜是end_id没设对

zengrh3 commented 3 months ago

@Oldpan @qism I met the same question as well while in my own Llama design. I passed the eos_id to the runner.generate function, but it still generates the token until met the length of max_new_tokens.

DefTruth commented 3 months ago

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

我猜是end_id没设对

我猜你猜得对

qism commented 3 months ago

end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决

bnuzhanyu commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:https://github.com/bnuzhanyu/trtllm-mmodal/pull/1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制:

  1. 只能用python的runtime,对于要用tritonserver的trtllm backend,目前没办法适配
  2. 只能生成普通token id,不支持生成多模对应的token_id
  3. 目前修改应该无法使用投机解码和beam search的特性(未测试)
Hukongtao commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:bnuzhanyu/trtllm-mmodal#1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制:

  1. 只能用python的runtime,对于要用tritonserver的trtllm backend,目前没办法适配
  2. 只能生成普通token id,不支持生成多模对应的token_id
  3. 目前修改应该无法使用投机解码和beam search的特性(未测试)

太牛了

amukkara commented 2 months ago

input_embeds can not be accessed directly. prompt_table should be used to pass visual features as input. The specific position of visual features within prompt changes from one model to another.

For multiple images, see https://github.com/NVIDIA/TensorRT-LLM/issues/2144#issuecomment-2330175706

OswaldoBornemann commented 2 months ago

@amukkara So is there any ways to pass the input_embeds into the TensorRT LLM directly?

scuizhibin commented 2 months ago

end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决

大佬你好,我现在可以实现InternVL2语言部分使用在trt-llm使用,但是那个图像部分可以使用trt-llm加速吗?

Hukongtao commented 2 months ago

Is there any plan to support this requirement? It seems that there are many related application scenarios. @byshiue

scuizhibin commented 1 month ago

@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么

请问如何将input_embeds 传入到模型中进行推理?

scuizhibin commented 1 month ago

使用pre + img + post拼prompt的形式

请问 “使用pre + img + post拼prompt的形式” 指的是什么?

hello-11 commented 2 weeks ago

@Hukongtao If you have no further questions, we will close it in a week.

Hukongtao commented 2 weeks ago

My question is whether this feature will be supported in the future?