InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
1.91k stars 120 forks source link

Output of `interleav_wrap_chat` #301

Open wlin-at opened 1 month ago

wlin-at commented 1 month ago

Hi, thanks for the great work. I tried the following code snippet with the internlm-xcomposer2-vl-7b model for QA task with two input images.

images = [osp.join( image_folder_dir, "COCO_val2014_000000143961.jpg"),
          osp.join( image_folder_dir, "COCO_val2014_000000274538.jpg")]
image1 = model.encode_img(images[0])
image2 = model.encode_img(images[1])
image = torch.cat((image1, image2), dim=0)
query = """First picture:<ImageHere>, second picture:<ImageHere>. Describe the subject of these two pictures?"""
response, _ = model.interleav_wrap_chat(tokenizer, query, image, history=[], meta_instruction= True)

(here the meta_instruction is a required positional argument, not sure whether it should be set to True or False) However, I realized that the returned response is actually {'inputs_embeds': wrap_embeds}. How should I further proceed to get the decoded text output? Thanks in advance!