Fantasyele / LLaVA-KD

46 stars 5 forks source link

Confusion about the vision tokens #1

Open Zhaoyi-Yan opened 4 weeks ago

Zhaoyi-Yan commented 4 weeks ago

Usually, LLM only generates the text tokens, however 图片 Usually, a [cls] token is passed to the lm_head to genearte the logit of the predicted token. And maybe only predict the tokens of the response? What's the meaning of taking prompt, vision tokens and response tokens as the distillation target? Can you elaborate on it, especially the vision token as the prediction.

youngwanLEE commented 2 weeks ago

+1 same question. In the page.6 of the main paper,

"Relation Distillation (RDist). To enable the student model to capture the complex relationships in visual information, we construct a self-correlation matrix from the visual tokens output by the LLM."

Fantasyele commented 2 weeks ago

Hi, thank you for your interest in our work.

In fact, in MLLM, the output of the LLM (output['logits']) includes more than just the response. Typically, during training, the label mask is used to exclude the visual and prompt parts, leaving only the response part for cross-entropy. However, our method leverages both the Visual and Response parts from output['logits']. (If you examine the dimensions of the LLM's input and output, you will notice that their lengths are consistent.)

Regarding how to obtain the visual tokens from output['logits']: When constructing the input data for LLM, visual tokens are preceded by an prompt. By using the index of this prompt and the length of the visual tokens, we can locate the position of the visual tokens within the output.

youngwanLEE commented 2 weeks ago

@Fantasyele Thanks for your explanation.

I have further questions.

Although the LLM (output['logits']) includes visual tokens, those visual tokens (full-set) are already made in the visual projector. However, in Equation 3, the loss term is defined in an autoregressive manner. The visual tokens are not response tokens in the LLM.

Could you let me know if my thoughts are wrong?

Thanks in advance.

Fantasyele commented 2 weeks ago

Hi~In MLLM, visual tokens are concatenated with text tokens after passing through the projector to form the input sequence (query) which is then fed into the LLM. The LLM autoregressively predicts the query, and the prediction results in output['logits'] will include visual tokens and response tokens. During training, typically only the cross-entropy between the response tokens and the true labels is calculated. In Equation 3, we use a KLD constraint to align the representations of the visual token part between the teacher model and the student model.

youngwanLEE commented 2 weeks ago

@Fantasyele Thanks. I have further questions. If my opinions are wrong, please correct me.

In MLLM, visual tokens are concatenated with text tokens after passing through the projector to form the input sequence (query) which is then fed into the LLM

I agree with it.

The LLM autoregressively predicts the query, and the prediction results in output['logits'] will include visual tokens and response tokens.

This is a still confusing point. In Eq.(3), the likelihood of next visual token is condition on all previous visual tokens. However, my confusing point is that the visual tokens are already made through the visual encoder and visual projector. So I think the visual token output from LLM is the processed visual tokens, not newly generated tokens like response tokens. In this assumption, the autoregressive-style loss objective for visual tokens does not make sense.

In addition, I agree that the attempt to compare the representation between teacher and student by using the output tokens from LLM, but it can be implemented by simply computing the difference of all visual tokens at once because those are the same visual token length.

Zhaoyi-Yan commented 2 weeks ago

@youngwanLEE Only response tokens are utilized as the input of autoregressive loss, for distillation loss, vision tokens, prompt tokens and response tokens are seperately used for distillation. It makes sense, since the vision tokens outputted by the LLM are effected by the text tokens, so can be regarded as "context-guided vision tokens" which are valuable guidance for distillation.

youngwanLEE commented 1 week ago

@Zhaoyi-Yan, I totally agree with your comment, but my point is that the KD loss of visual token in Eq.(3) is defined in an autoregressive way.