eric-ai-lab / MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
https://eric-ai-lab.github.io/minigpt-5.github.io/
Apache License 2.0
832 stars 51 forks source link

About image comprehension task #30

Closed MajorDavidZhang closed 7 months ago

MajorDavidZhang commented 8 months ago

Firstly, thank you for your contributions to the multi-modal large language model (MLLM) research with MiniGPT-5. I'm experiencing an issue while testing the model's image comprehension capabilities.

Issue Description: The model consistently generates meaningless images and text when provided with an image input.

Reproduction Steps:

  1. Running playground.py with the same example produces expected outputs.
  2. Text-only inputs result in reasonable responses. For example:
    • Text Input: "###Human: Can you tell me a joke? ###Assistant:"
    • Generated Text: "Sure! What did the snowman say to his wife? Can we go in circles around a little longer, hon?) ###"
  3. However, with image inputs, the responses are not meaningful. Here is an example:
    • Text Input: "Give the following images in ImageContent format. You will be able to see the images once I provide it to you. ###Human: Can you describe the imageImageContent? ###Assistant:"
    • Image Used: image
    • Generated Output: Text: "yes i can [IMG0] ###"; Image: image

In more cases, the model will just refuse to generate any text output and just generate some meaningless text.

Given that MiniGPT-5 builds upon MiniGPT-4, which handled similar tasks effectively, I am curious about your insights on this issue. Have you encountered or tested this scenario during development?

Thank you for your time and assistance.

KzZheng commented 8 months ago

I think this issue is caused by the data format in VIST. During the VIST training, the model is usually asked to generate new image. After fine-tuning, it may forget some original prompts in the MiniGPT-4. You can try the weight on MMDialog, 'stage2_mmdialog.ckpt'. And it should be able to describe the image correctly. I just tried your image example, and the model can output "There are 2 boys playing with sand at the beach."