Open sujaly opened 1 year ago
We don't train the Q-former in our model. Only the linear layer that connect q-former and the llm is trained. In addition, in our llama2 version we further removed the q-former. The linear layer connect the clip vit directly with llm
这是来自QQ邮箱的假期自动回复邮件。 你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
I have understood the training process and thank you very much for your reply.
We don't train the Q-former in our model. Only the linear layer that connect q-former and the llm is trained. In addition, in our llama2 version we further removed the q-former. The linear layer connect the clip vit directly with llm
I wander why remove q-former, if there is no q-former, the model is same as llava. right?
@hhnqqq yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.
@hhnqqq yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.
I trained llava on my downstream classification dataset. The performance of llava was terrible. However, MiniGPT-4 with vicuna v1.5 performs flawlessly. I suspect that the difference in pretraining data could be the reason. Additionally, q-former may also contribute to the improved performance. (I trained two models on the attention layer of the large language model using Lora, and keep the projection layer freezed.)
yup, it will share some similarities with llava in the architecture. The reason is that we think the Q-former's job should also be doable by LLM. In this case, instead of using a Q-former to 'translate' the image first, directly letting LLM see the original image patches is more 'elegant'. And we don't find obvious differences with or without q-former in our exploration.
Does this mean that BLIP2's work on Qformer is redundant? As long as the LLM is strong enough, there is no need to do too much alignment of images and text?
When I was ready for the first phase of pre-training, I found that the freeze_qformer was frozen. The purpose of the first stage of pre-training is for the freeze_qformer to learn the representation of the image, so why freeze it here?
model: arch: mini_gpt4 model_type: pretrain_vicuna freeze_vit: True freeze_qformer: True