Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
https://minigpt-4.github.io
BSD 3-Clause "New" or "Revised" License
25.4k stars 2.91k forks source link

The diffrence with BLIP2? #7

Closed hello451 closed 1 year ago

hello451 commented 1 year ago

It seems that the mini-GPT4 is the BLIP2 that just changing the LLM to the open-source GPT?

TsuTikgiau commented 1 year ago

The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.

Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.

vateye commented 1 year ago

Can you provide more samples for Stage-1 training to verify that Stage-2 is needed?

TsuTikgiau commented 1 year ago

We plan to update our paper in 2 days to provide some qualitative and quantitative comparisons for the difference between stage-1 and stage-2. Stay tuned!

Pilot-LH commented 1 year ago

Excellent job! I have some inquiries regarding the model:

  1. Based on my understanding, there are two stages in BLIP-2, and you are utilizing the pre-trained Q-Former from the second stage of BLIP-2 directly aligned with FlanT5 in this model. Please correct me if I am mistaken.
  2. It is intriguing to note that only a linear layer needs to be adjusted in BLIP-2, rather than Q-Former. When you mentioned "after the traditional image-text training stage like BLIP-2 did," were you referring to the first stage, second stage, or both stages of BLIP-2? As far as I know, the first stage of BLIP-2 is not traditional, and it is the key to the success of BLIP-2 (as shown in Figure 5 of the BLIP-2 paper).
  3. Fine-tuning on a small but high-quality dataset appears to be quite effective. Is it possible for BLIP-2 to benefit from this approach? I ask because vicuna is not entirely open source.
TsuTikgiau commented 1 year ago

@Pilot-LH Thanks for your interest! A1. Yes you are correct, we are directly using the Q-Former aligned with FlanT5 XXL in our model A2. Here I mean the second stage of BLIP-2, as our first stage pertaining is quite similar to BLIP-2 second stage training. The difference is that we only train one linear layer. A3. This is a good question. We don't try this, but I think the reason that it works in our case is that Vicuna itself alone is already a close-to-chatgpt level model with a powerful conversation ability. The second stage fine-tuning activates this ability again when visual input is given. Therefore, the training is light. In contrast, Flan-T5's conversation ability is weak. So I guess Flan-T5 should first learn how to chat well with humans. And our small dataset doesn't have this capacity to teach Flan-T5 how to talk. I guess there should be soon some full open-sourced LLMs that work like Vicuna, as how Vicuna is built it is clear. And think our training method can be applied directly once such a LLM is ready.

Pilot-LH commented 1 year ago

Thank you for your response. I now have a clear understanding of the model. I agree with your point that this approach can be applied to other large language models (LLMs). In my opinion, one of the major challenges for the open source community is to reproduce LLaMa. Once this is accomplished, there will likely be much more advanced models available than the current Vicuna model.