Closed MajorDavidZhang closed 7 months ago
I think this issue is caused by the data format in VIST. During the VIST training, the model is usually asked to generate new image. After fine-tuning, it may forget some original prompts in the MiniGPT-4. You can try the weight on MMDialog, 'stage2_mmdialog.ckpt'. And it should be able to describe the image correctly. I just tried your image example, and the model can output "There are 2 boys playing with sand at the beach."
Firstly, thank you for your contributions to the multi-modal large language model (MLLM) research with MiniGPT-5. I'm experiencing an issue while testing the model's image comprehension capabilities.
Issue Description: The model consistently generates meaningless images and text when provided with an image input.
Reproduction Steps:
playground.py
with the same example produces expected outputs.In more cases, the model will just refuse to generate any text output and just generate some meaningless text.
Given that MiniGPT-5 builds upon MiniGPT-4, which handled similar tasks effectively, I am curious about your insights on this issue. Have you encountered or tested this scenario during development?
Thank you for your time and assistance.