Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
https://minigpt-4.github.io
BSD 3-Clause "New" or "Revised" License
25.42k stars 2.92k forks source link

Why freeze the first stage_1 freeze_qformer Settings to True? #360

Open sujaly opened 1 year ago

sujaly commented 1 year ago

When I was ready for the first phase of pre-training, I found that the freeze_qformer was frozen. The purpose of the first stage of pre-training is for the freeze_qformer to learn the representation of the image, so why freeze it here?

model: arch: mini_gpt4 model_type: pretrain_vicuna freeze_vit: True freeze_qformer: True

TsuTikgiau commented 1 year ago

We don't train the Q-former in our model. Only the linear layer that connect q-former and the llm is trained. In addition, in our llama2 version we further removed the q-former. The linear layer connect the clip vit directly with llm

sujaly commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。   你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。

sujaly commented 1 year ago

I have understood the training process and thank you very much for your reply.

hhnqqq commented 1 year ago

We don't train the Q-former in our model. Only the linear layer that connect q-former and the llm is trained. In addition, in our llama2 version we further removed the q-former. The linear layer connect the clip vit directly with llm

I wander why remove q-former, if there is no q-former, the model is same as llava. right?

TsuTikgiau commented 1 year ago

@hhnqqq yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.

hhnqqq commented 1 year ago

@hhnqqq yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.

I trained llava on my downstream classification dataset. The performance of llava was terrible. However, MiniGPT-4 with vicuna v1.5 performs flawlessly. I suspect that the difference in pretraining data could be the reason. Additionally, q-former may also contribute to the improved performance. (I trained two models on the attention layer of the large language model using Lora, and keep the projection layer freezed.)

Davidwhw commented 5 months ago

yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.

Does this mean that BLIP2's work on Qformer is redundant? As long as the LLM is strong enough, there is no need to do too much alignment of images and text?