OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
https://internvl.github.io/
MIT License
3.99k stars 304 forks source link

Does InternVL support multi-image interleaved conversations: #153

Open irexyc opened 2 months ago

irexyc commented 2 months ago

According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.

# Demo code in readme.

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
# prompt looks like this:
# <|im_start|>system\n{system_message}<|im_end|><|im_start|>user\n<img>placeholder ... </img>\n{question}<|im_end|><|im_start|>assistant\n

我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话,如果支持的话,每一轮对话中,图像的token应该如何插入,希望可以给一个例子。

I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each conversations?

# Does InternVL support something like this? (I know pixel_values should be passed, 
# but I can't find demo code about putting pixel_values in interleaved text-and-image conversations)

pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(question, response)

pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values2, question, generation_config, history=history, return_history=True)
print(question, response)

question = "What is the difference about the two images?" # Describe the two pictures in detail
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(question, response)
image
hjh0119 commented 2 months ago

model.chat只支持history为None时传入新的图片

    def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
             IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):

        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
        self.img_context_token_id = img_context_token_id
        if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
            eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
        else:
            eos_token_id = tokenizer.eos_token_id

        from .conversation import get_conv_template

        template = get_conv_template(self.template)
        image_bs = pixel_values.shape[0]
        print(f'dynamic ViT batch size: {image_bs}')
        if history is None:
            history = []
            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
            question = image_tokens + '\n' + question
        else:
            for (old_question, old_answer) in history:
                template.append_message(template.roles[0], old_question)
                template.append_message(template.roles[1], old_answer)

你可以仿照chat方法封装generate方法 或许你也可以尝试swift框架https://github.com/OpenGVLab/InternVL/issues/129

irexyc commented 2 months ago

@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

hjh0119 commented 2 months ago

@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

图像-文字交错对话是可以的,你可以参考这里

irexyc commented 2 months ago

@hjh0119

我看了一下你们的代码,拼法貌似跟internvl-demo一样,都是放在了第一轮的user里面,跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样,image的token在每一轮的user里面,而不是集中在第一轮的user里面。

所以还是想跟internvl的作者确认一下,对于多轮带图片的对话,internvl正确的处理方式是什么。

hjh0119 commented 2 months ago

@irexyc 我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样

<<< Describe this image.
Input a media path or URL <<<  http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.
irexyc commented 2 months ago

@hjh0119

对于internvl: 你们的代码,输入看起来是交错的,每次都有新的图片,但是你们其实是在维护一个图片列表,然后最终的prompt还是用的这个函数拼在了最开始的user里面

对于deepseek-vl 你们没有维护image_list,而是根据来插入图片的embedding,而是在每轮的user当中的。

前者,如果新一轮的对话中有图片,会改变历史prompt(kv-cache没办法复用,需要重新算)。后者并不会改变,这两者我觉得并不一样。

hjh0119 commented 2 months ago

我理解了 主要还是历史图片tokens处理 官方这里确实没有看到一个处理方式