Does TensorRT-LLM support passing input_embeds directly？

Hukongtao commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

DefTruth commented 3 months ago

same question

Hukongtao commented 3 months ago

same question

哇大佬！知乎还关注了你

Oldpan commented 3 months ago

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

DefTruth commented 3 months ago

是的，目前是只能采用这种实现。不过这种方式还是有点问题，就是对于传input emb，当你需要使用penalty的时候，transformers实际只会考虑output ids做惩罚，比如repetition_penalty。而trtllm通过传input ids做推理，实际上会考虑input ids+output ids一起做penalty, 从而会导致，在两边都用penalty的情况下，输出结果与trn的无法对齐。

Hukongtao commented 3 months ago

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

DefTruth commented 3 months ago

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。

I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

其实支持直接传input emb就简单多了

Oldpan commented 3 months ago

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

多图也是可以的，可以参考trt-llm中vila的实现，不过确实只要传input emb就简单多了hh @DefTruth ，话说提到的trn是啥啊

DefTruth commented 3 months ago

我也好奇这个input_embeds如何直接传，不确定你这里直接传input_embeds的具体需求是什么，是否和我一样。不过InternVL2这个可以使用trt-llm跑起来，使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好，实际输入trt-llm decoder engine的时候，和图像的visual_feature一起传入decoder engine，input_ids在其中进行embed后和visual_feature一起concat，这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.

你说的这种情况只能针对单图吧，如果是多图呢？prompt里面有多个<image>标志位，怎么使用pre + img + post这种方式传呢？

多图也是可以的，可以参考trt-llm中vila的实现，不过确实只要传input emb就简单多了hh @DefTruth ，话说提到的trn是啥啊

trn -> transformers, 偷懒，缩写

qism commented 3 months ago

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

Oldpan commented 3 months ago

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

我猜是end_id没设对

zengrh3 commented 3 months ago

@Oldpan @qism I met the same question as well while in my own Llama design. I passed the eos_id to the runner.generate function, but it still generates the token until met the length of max_new_tokens.

DefTruth commented 3 months ago

@Oldpan internvl2-2B 跑起来推理总是输出max_token数，这是为什么

我猜是end_id没设对

我猜你猜得对

qism commented 3 months ago

end_id =2没错的经测试，run.py --stop_words "<|im_end|>" 可以解决

bnuzhanyu commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模，测试正常，看看是否能帮助你们。具体改动：https://github.com/bnuzhanyu/trtllm-mmodal/pull/1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵，以及加权权重。最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务，对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制：

只能用python的runtime，对于要用tritonserver的trtllm backend，目前没办法适配
只能生成普通token id，不支持生成多模对应的token_id
目前修改应该无法使用投机解码和beam search的特性（未测试）

Hukongtao commented 3 months ago

For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.

我们基于0.11.0 修改了qwen2以增加多模，测试正常，看看是否能帮助你们。具体改动：bnuzhanyu/trtllm-mmodal#1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵，以及加权权重。最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务，对input_ids判断后进行设置成0/1的float16矩阵。

有以下限制：

只能用python的runtime，对于要用tritonserver的trtllm backend，目前没办法适配

只能生成普通token id，不支持生成多模对应的token_id

目前修改应该无法使用投机解码和beam search的特性（未测试）

太牛了

amukkara commented 2 months ago

input_embeds can not be accessed directly. prompt_table should be used to pass visual features as input. The specific position of visual features within prompt changes from one model to another.

For multiple images, see https://github.com/NVIDIA/TensorRT-LLM/issues/2144#issuecomment-2330175706