About image comprehension task

Firstly, thank you for your contributions to the multi-modal large language model (MLLM) research with MiniGPT-5. I'm experiencing an issue while testing the model's image comprehension capabilities.

Issue Description: The model consistently generates meaningless images and text when provided with an image input.

Reproduction Steps:

Running playground.py with the same example produces expected outputs.
Text-only inputs result in reasonable responses. For example:
- Text Input: "###Human: Can you tell me a joke? ###Assistant:"
- Generated Text: "Sure! What did the snowman say to his wife? Can we go in circles around a little longer, hon?) ###"
However, with image inputs, the responses are not meaningful. Here is an example:
- Text Input: "Give the following images in ImageContent format. You will be able to see the images once I provide it to you. ###Human: Can you describe the imageImageContent? ###Assistant:"
- Image Used:
- Generated Output: Text: "yes i can [IMG0] ###"; Image:

In more cases, the model will just refuse to generate any text output and just generate some meaningless text.

Given that MiniGPT-5 builds upon MiniGPT-4, which handled similar tasks effectively, I am curious about your insights on this issue. Have you encountered or tested this scenario during development?

Thank you for your time and assistance.

eric-ai-lab / MiniGPT-5

About image comprehension task #30