OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

Support multi-picture input #270

Closed AuroraZengfh closed 2 weeks ago

AuroraZengfh commented 2 weeks ago

Hi: I want to upload more than one pictures during training/inference. So I wonder does MiniCPM-Llama3-V 2.5 support multi-picture uplodaing? If it is, how can I achieve that?

emanuelevivoli commented 2 weeks ago

Hi, I have seen this discussion in huggingface about multi-images inputs, maybe this helps you providing the images: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/discussions/2#6650908065ca369b64a301b4

Pay attention at the size of the images, because LLaMA3 has 8k context window which limits the resolution and the number of images you can feed in it. Hope it helps, Lele

emanuelevivoli commented 2 weeks ago

I have tried the example with the two images in the assets (airplane and hk_OCR) and the model hallucinates much. With the following setting:

msgs = [{'type': 'text', 'value': 'Answer in detail.'},
 {'type': 'text', 'value': 'This is the first image:'},
 {'type': 'image', 'value': './assets/airplane.jpeg'},
 {'type': 'text', 'value': 'And this the second one:'},
 {'type': 'image', 'value': './assets/hk_OCR.jpg'},
 {'type': 'text',
  'value': 'Please describe what you see in detail. Focus on one at a time.'}]

The images are the following: airplane hk_OCR

The model answer is the following:

"The image presents a dynamic urban scene, likely from a major city known for its vibrant nightlife and bustling commercial activity. The central focus is an airplane in mid-flight, which suggests the image captures a moment of travel or transit. The plane's livery is predominantly white with blue accents, and it bears the logo of China Southern Airlines, indicating its origin or destination. The aircraft's design is modern and large, characteristic of long-haul commercial flights.\n\nBelow the plane, the street below is alive with neon lights, creating a colorful backdrop that contrasts with the clear sky. The multitude of signs in various colors and scripts suggests a densely populated area with a high concentration of businesses, possibly a shopping or entertainment district. The presence of multiple vehicles, including buses, implies a well-connected transportation network typical of a metropolitan area.\n\nThe image evokes a sense of movement and energy, capturing the essence of urban life and the interconnectedness of different modes of transportation. It also hints at the economic and cultural vibrancy of the location depicted."

which makes me thinking the model is not able to distinguish the two images, nor understanding the assignment of "first image" and "second image" I added in the text. I'm surprised by this model capabilities for single image, but as it is not trained for interleaved images or multi-image inputs this drop in performances is expected.

AuroraZengfh commented 2 weeks ago

@emanuelevivoli thanks a lot for your kind suggestion!

Cuiunbo commented 2 weeks ago

@emanuelevivoli Yes, without training, it's possible that the model will hallucinate and not understand what multiple images are

emanuelevivoli commented 1 week ago

Well, I have tried to fuse images as it is done here (variation of LLAVA project https://github.com/mapluisch/LLaVA-CLI-with-multiple-images ) and it is worth a try 😄 I got better results with vertical concatenation 👍