OpenGVLab / VisionLLM

VisionLLM Series
https://arxiv.org/abs/2305.11175
Apache License 2.0
865 stars 22 forks source link

How to get last-layer hidden states $$H_{link}$$ during testing ? #11

Closed xushilin1 closed 3 months ago

xushilin1 commented 3 months ago

As mentioned in your paper, the Super-Link Queries are automatically added after the input embeddings of the routing token. However, during testing, users' input prompts do not include any routing token. How can you send the Super-Link Queries to MLLM and obtain the corresponding hidden states $H_{link}$?

wjn922 commented 3 months ago

Thanks for your question.

During testing, we rely on the LLM to interpret the users' input prompts and output the different routing tokens when needed. That's why we need to construct instruction templates for different tasks and finetune the LLM, which is specified in Sec.3.2 (1) and Appendix E.

Here is an example for detection. USER: Where can we locate the dog in the image? ASSISTANT: The detection results for dog [DET] are presented.

xushilin1 commented 3 months ago

During training, you will input the [DET] and corresponding super-link queries $Q{link}$ into LLM to obtain $H{link}$, which is then sent to the downstream decoder.

During testing, since the input prompt does not include the [DET] and $Q{link}$, so how can you get the $H{link}$ ?

Is it correct that during training, the downstream decoders receive $H{link}$ while during testing they receive $Q{link}$?

Is there any inconsistency in the input of downstream decoders during training and testing?

wjn922 commented 3 months ago

During testing, the LLM will output [DET], and we immediately append the $Q_{link}$ after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + numembeds, C]. We can still get the last-layer hidden states $H{link}$ during testing.

This part is the code for handling the super-link queries, which works well for both training and testing: https://github.com/OpenGVLab/VisionLLM/blob/34a8144829c30361adccc15e6b2a25c80a62d1fd/VisionLLMv2/visionllmv2/model/modeling_visionllmv2.py#L421

wjn922 commented 3 months ago

During testing, the LLM will output [DET], and we immediately append the $Q_{link}$ after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + numembeds, C]. We can still get the last-layer hidden states $H{link}$ during testing.

This part is the code for handling the super-link queries, which works well for both training and testing: https://github.com/OpenGVLab/VisionLLM/blob/34a8144829c30361adccc15e6b2a25c80a62d1fd/VisionLLMv2/visionllmv2/model/modeling_visionllmv2.py#L421

pangzss commented 3 months ago

During testing, the LLM will output [DET], and we immediately append the Qlink after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + num_embeds, C]. We can still get the last-layer hidden states Hlink during testing.

This part is the code for handling the super-link queries, which works well for both training and testing:

https://github.com/OpenGVLab/VisionLLM/blob/34a8144829c30361adccc15e6b2a25c80a62d1fd/VisionLLMv2/visionllmv2/model/modeling_visionllmv2.py#L421

Does this mean that during training, the earlier superlink embeddings do not attend to later ones due to the causal attention mask, but during test, different superlink embeddings get to attend to each other as one forward pass is used to get all their hidden states?

wjn922 commented 3 months ago

During both training and testing, the LLM always uses the causal mask.