Open Hukongtao opened 3 months ago
same question
same question
哇 大佬! 知乎还关注了你
我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。
I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.
是的,目前是只能采用这种实现。不过这种方式还是有点问题,就是对于传input emb,当你需要使用penalty的时候,transformers实际只会考虑output ids做惩罚,比如repetition_penalty。而trtllm通过传input ids做推理,实际上会考虑input ids+output ids一起做penalty, 从而会导致,在两边都用penalty的情况下,输出结果与trn的无法对齐。
我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。
I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.
你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个<image>
标志位,怎么使用pre + img + post这种方式传呢?
我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。
I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.
你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个
<image>
标志位,怎么使用pre + img + post这种方式传呢?
其实支持直接传input emb就简单多了
我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.
你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个
<image>
标志位,怎么使用pre + img + post这种方式传呢?
多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊
我也好奇这个input_embeds如何直接传,不确定你这里直接传input_embeds的具体需求是什么,是否和我一样。 不过InternVL2这个可以使用trt-llm跑起来,使用pre + img + post拼prompt的形式。这个token id是在输入trt-llm之前确定好,实际输入trt-llm decoder engine的时候,和图像的visual_feature一起传入decoder engine,input_ids在其中进行embed后和visual_feature一起concat,这个是可以实现的。 I'm also curious about how input_embeds can be directly passed. I'm not sure about the specific requirement for directly passing input_embeds in your case, whether it is similar to mine. However, InternVL2 can be run using trt-llm by using a prompt format that concatenates pre + img + post. The token id is determined before passing it into the trt-llm engine, and during the actual input into the trt-llm decoder engine, it is passed along with the visual feature of the image into the decoder engine. The input_ids are embedded and concatenated with the visual feature within the decoder engine, and this can be implemented.
你说的这种情况只能针对单图吧,如果是多图呢?prompt里面有多个
<image>
标志位,怎么使用pre + img + post这种方式传呢?多图也是可以的,可以参考trt-llm中vila的实现,不过确实只要传input emb就简单多了hh @DefTruth ,话说提到的trn是啥啊
trn -> transformers, 偷懒,缩写
@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么
@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么
我猜是end_id没设对
@Oldpan @qism I met the same question as well while in my own Llama design. I passed the eos_id
to the runner.generate
function, but it still generates the token until met the length of max_new_tokens.
@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么
我猜是end_id没设对
我猜你猜得对
end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决
For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.
我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:https://github.com/bnuzhanyu/trtllm-mmodal/pull/1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。
有以下限制:
For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.
我们基于0.11.0 修改了qwen2以增加多模,测试正常,看看是否能帮助你们。 具体改动:bnuzhanyu/trtllm-mmodal#1 核心思想是多传入 bs seq_len hidden_size的mmodal_embedding矩阵,以及加权权重。 最终给transformer的hidden_state = input_mask word_emb + mmodal_mask mmodal_embedding input mask和mmodal mask可以根据业务,对input_ids判断后进行设置成0/1的float16矩阵。
有以下限制:
- 只能用python的runtime,对于要用tritonserver的trtllm backend,目前没办法适配
- 只能生成普通token id,不支持生成多模对应的token_id
- 目前修改应该无法使用投机解码和beam search的特性(未测试)
太牛了
input_embeds
can not be accessed directly. prompt_table
should be used to pass visual features as input.
The specific position of visual features within prompt changes from one model to another.
For multiple images, see https://github.com/NVIDIA/TensorRT-LLM/issues/2144#issuecomment-2330175706
@amukkara So is there any ways to pass the input_embeds
into the TensorRT LLM directly?
end_id =2没错的 经测试,run.py --stop_words "<|im_end|>" 可以解决
大佬你好,我现在可以实现InternVL2语言部分使用在trt-llm使用,但是那个图像部分可以使用trt-llm加速吗?
Is there any plan to support this requirement? It seems that there are many related application scenarios. @byshiue
@Oldpan internvl2-2B 跑起来 推理总是输出max_token数,这是为什么
请问如何将input_embeds 传入到模型中进行推理?
使用pre + img + post拼prompt的形式
请问 “使用pre + img + post拼prompt的形式” 指的是什么?
@Hukongtao If you have no further questions, we will close it in a week.
My question is whether this feature will be supported in the future?
For multimodal models, we usually need to combine visual features and input_embeds as final input_embeds and send them to the model for inference. Currently, this combination method may be different for different multimodal models. For example, for InternVL2 we have: https://huggingface.co/OpenGVLab/InternVL2-26B/blob/72496452c5525ba579fdd87d62bb958bfa59020e/modeling_internvl_chat.py#L318-L334 Therefore, can TensorRT-LLM support directly passing the final input_embeds? Although TensorRT-LLM provides us with the prompt_table parameter, in some cases, prompt_table cannot meet our needs.