AlignGPT-VL / AlignGPT

Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"
https://aligngpt-vl.github.io/
29 stars 3 forks source link

Questions about the Adaptive Alignment-based Instruction-tuning #1

Closed friedrichor closed 3 months ago

friedrichor commented 3 months ago

Hello. Thanks for your excellent work!

I find your work very interesting and the motivation is very sound. The experimental results also prove that the work is effective. But I have some confusion about the model architecture.

I can clearly understand the architecture and strategy of the pre-training phase, assigning a corresponding embedding spliced in front of all inputs based on the similarity of the image and text. Said another way, I can treat each alignment embedding as a task-related soft prompt (similar to P-Tuning), from weak to strong, representing tasks that require local features (e.g. image captioning), and tasks that require global features (e.g. VQA). But there are some confusions about the series of operations in the instruction-tuning phase.

  1. What is the physical significance of $H_I \otimes H_T$ in Equation (2) in the paper? $H_I$ is image_embeds and $H_T$ is the average embedding of the text. It looks like $H_I$ does a scaling based on $H_T$. What is the practical significance of the results obtained? I cannot understand why the result obtained this way, after MLP and softmax, is a weight matrix $\alpha$. Because of these N_IMAGE_TOKEN tokens of $H_I$, each token does the same scaling according to $H_T$ and is not scaled for a particular region.
  2. I'm also confused by Equation (3). According to the pre-training phase, $H_{align}$ should be equivalent to the meaning of an alignment embedding. It is understandable if it is $H_N$ or $\sum \alpha H_i$ alone, but why is it an accumulation of them, and does this differ from the pre-training phase?

I would appreciate it if you could answer my confusion.

friedrichor commented 3 months ago

To supplement, my question was asked after having already looked at the code, and I think the shapes of the input and output vectors for both formulas are reasonable, but I still don't understand why the vectors after being so operated make the physical sense that you claim in your paper.

1429904852 commented 3 months ago

To supplement, my question was asked after having already looked at the code, and I think the shapes of the input and output vectors for both formulas are reasonable, but I still don't understand why the vectors after being so operated make the physical sense that you claim in your paper.

Sorry for your confusion. Let me first describe the workflow of AlignGPT.

Then, we answer your questions step by step:

friedrichor commented 3 months ago

Thanks very much for your answer. I think I have understood your method.