Open irexyc opened 2 months ago
model.chat只支持history为None时传入新的图片
def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
self.img_context_token_id = img_context_token_id
if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
else:
eos_token_id = tokenizer.eos_token_id
from .conversation import get_conv_template
template = get_conv_template(self.template)
image_bs = pixel_values.shape[0]
print(f'dynamic ViT batch size: {image_bs}')
if history is None:
history = []
image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
question = image_tokens + '\n' + question
else:
for (old_question, old_answer) in history:
template.append_message(template.roles[0], old_question)
template.append_message(template.roles[1], old_answer)
你可以仿照chat方法封装generate方法 或许你也可以尝试swift框架https://github.com/OpenGVLab/InternVL/issues/129
@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。
@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。
@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。
@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。
图像-文字交错对话是可以的,你可以参考这里
@hjh0119
我看了一下你们的代码,拼法貌似跟internvl-demo一样,都是放在了第一轮的user里面,跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样,image的token在每一轮的user里面,而不是集中在第一轮的user里面。
所以还是想跟internvl的作者确认一下,对于多轮带图片的对话,internvl正确的处理方式是什么。
@irexyc 我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.
我理解了 主要还是历史图片tokens处理 官方这里确实没有看到一个处理方式
According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.
我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话,如果支持的话,每一轮对话中,图像的token应该如何插入,希望可以给一个例子。
I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each conversations?