I am following the LLaVa-NeXT-Image example and have been able to get the exact same results as the example. However, I wish to give the model 2 images or more in the same conversation prompt. So now I am passing 2 images for the image_tensor and modifying the prompt part like this:
image1 = Image.open("1.png")
image2 = Image.open("2.png")
image_tensor = process_images([image1, image2], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
text1 = DEFAULT_IMAGE_TOKEN + "\nThis is image1"
text2 = DEFAULT_IMAGE_TOKEN + "\nThis is image2"
question = "tell me how many images do you see?"
conv.append_message(conv.roles[0], text1)
conv.append_message(conv.roles[0], text2)
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
...
And I am getting the output as ['\nI see one image of a person with pink hair and glasses.']
The interesting thing is that image1 is a photo of a woman with pink hair, and image2 is a man with glasses. So the model seems to see both images but treats them as a whole. I don't know if there is something wrong with my implementation of the conversation or if this is the expected behavior of llavanext-llama3-8B. Also I would want to know if there are ways to let the model see 2 images at the same time and treat them as different images so that I can ask it to compare them?
I am following the LLaVa-NeXT-Image example and have been able to get the exact same results as the example. However, I wish to give the model 2 images or more in the same conversation prompt. So now I am passing 2 images for the image_tensor and modifying the prompt part like this:
And I am getting the output as
['\nI see one image of a person with pink hair and glasses.']
The interesting thing is that image1 is a photo of a woman with pink hair, and image2 is a man with glasses. So the model seems to see both images but treats them as a whole. I don't know if there is something wrong with my implementation of the conversation or if this is the expected behavior of llavanext-llama3-8B. Also I would want to know if there are ways to let the model see 2 images at the same time and treat them as different images so that I can ask it to compare them?