Closed hello451 closed 1 year ago
The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.
Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.
Can you provide more samples for Stage-1 training to verify that Stage-2 is needed?
We plan to update our paper in 2 days to provide some qualitative and quantitative comparisons for the difference between stage-1 and stage-2. Stay tuned!
Excellent job! I have some inquiries regarding the model:
@Pilot-LH Thanks for your interest! A1. Yes you are correct, we are directly using the Q-Former aligned with FlanT5 XXL in our model A2. Here I mean the second stage of BLIP-2, as our first stage pertaining is quite similar to BLIP-2 second stage training. The difference is that we only train one linear layer. A3. This is a good question. We don't try this, but I think the reason that it works in our case is that Vicuna itself alone is already a close-to-chatgpt level model with a powerful conversation ability. The second stage fine-tuning activates this ability again when visual input is given. Therefore, the training is light. In contrast, Flan-T5's conversation ability is weak. So I guess Flan-T5 should first learn how to chat well with humans. And our small dataset doesn't have this capacity to teach Flan-T5 how to talk. I guess there should be soon some full open-sourced LLMs that work like Vicuna, as how Vicuna is built it is clear. And think our training method can be applied directly once such a LLM is ready.
Thank you for your response. I now have a clear understanding of the model. I agree with your point that this approach can be applied to other large language models (LLMs). In my opinion, one of the major challenges for the open source community is to reproduce LLaMa. Once this is accomplished, there will likely be much more advanced models available than the current Vicuna model.
It seems that the mini-GPT4 is the BLIP2 that just changing the LLM to the open-source GPT?