how to realize multi-image correlation in vqa task?

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

how to realize multi-image correlation in vqa task? #200

Open fansticOne opened 8 months ago

fansticOne commented 8 months ago

In vqa task, I want to input two images and ask a question about the two images,how to realize it?

LukeForeverYoung commented 8 months ago

You can pass a list of images and place the same number of "<|image|>" in your prompt.

fansticOne commented 8 months ago

I pass a list of images, say 2 images, and modify the prompt. The image_tensor after preprocess has batch size of 2, while the input_ids has batch size of 1,then I run model.generate(), I do get a result, however the result is wrong. Do I misunderstand?

LukeForeverYoung commented 7 months ago

I pass a list of images, say 2 images, and modify the prompt. The image_tensor after preprocess has batch size of 2, while the input_ids has batch size of 1,then I run model.generate(), I do get a result, however the result is wrong. Do I misunderstand?

Could you provide an example and the incorrect response generated by the owl? Btw, the owl has not been trained on SFT data that includes multiple images. Therefore, it is reasonable to expect that it might fail in some cases.

fansticOne commented 7 months ago

Here are the two images I passed 1664356777209_m_11 1664356777209_m_17 the prompt is 'USER: <|image|><|image|>{}\nAnswer the question using a single word or phrase. ASSISTANT:'.format('Does the dog in the first picture have same color with the dog in the second picture?') the response generated by the owl is 'Yes'