zhourax commented 2 months ago

model = AutoModelForCausalLM.from_pretrained('your model path').cuda().eval() tokenizer = AutoTokenizer.from_pretrained('your model path')

images = ["./a.png", "./b.png"] image1 = model.encode_img(images[0]) image2 = model.encode_img(images[1]) image = torch.cat((image1, image2), dim=0)

query = ""First picture:, second picture:. Describe the subject of these two pictures?"""

response, _ = model.interleav_wrap_chat(tokenizer, query, image, history=[]) print(response) 模型为InternLM-XComposer2-VL-7B，使用以上代码会出现如下报错

image1 = model.encode_img(images[0])

File "/root/.cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 118, in encode_img img_embeds, atts_img, img_target = self.img2emb(image) File "/root/.cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 122, in img2emb img_embeds = self.vision_proj(self.vit(image.to(self.device))) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype

yuhangzang commented 2 months ago

You may try (1) image = image.to('cuda') and (2) prompt = 'First picture: <ImageHere>, second picture: <ImageHere>. Describe the subject of these two pictures?'

zhourax commented 2 months ago

It seems that the issue lies in the line: image1 = model.encode_img(images[0]) My images[0] is a path to a regular PNG image, which is a string type. The only change I made was to modify the build_vision_tower() in build_mlp.py def build_vision_tower():

vision_tower = 'openai/clip-vit-large-patch14-336'

vision_tower = '/home/xxx/model/clip-vit-large-patch14-336'
return CLIPVisionTower(vision_tower)

Could this be the reason for the issue?

Traceback (most recent call last): File "/home/xxx/test.py", line 79, in image1 = model.encode_img(images[0]) File "/root/.cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 118, in encode_img img_embeds, atts_img, img_target = self.img2emb(image) File "/root/.cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 122, in img2emb img_embeds = self.vision_proj(self.vit(image.to(self.device))) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype

yuhangzang commented 2 months ago

You can check if the image is of 'float32' and the model is of 'float16' dtype.

yuhangzang commented 2 months ago

Kindly re-open if you still have any questions.

wlin-at commented 1 month ago

Hi, thanks for the great work. I tried the following code snippet with the internlm-xcomposer2-vl-7b model.

images = [osp.join( image_folder_dir, "COCO_val2014_000000143961.jpg"),
          osp.join( image_folder_dir, "COCO_val2014_000000274538.jpg")]
image1 = model.encode_img(images[0])
image2 = model.encode_img(images[1])
image = torch.cat((image1, image2), dim=0)
query = """First picture:<ImageHere>, second picture:<ImageHere>. Describe the subject of these two pictures?"""
response, _ = model.interleav_wrap_chat(tokenizer, query, image, history=[], meta_instruction= True)

(here the meta_instruction is a required positional argument, not sure whether it should be set to True or False) However, I realized that the returned response is actually {'inputs_embeds': wrap_embeds}. How should I further proceed to get the decoded text output? Thanks in advance!

InternLM / InternLM-XComposer

请问XComposer2-VL现在支持多图输入吗？ #297

vision_tower = 'openai/clip-vit-large-patch14-336'