Closed xushilin1 closed 3 months ago
Thanks for your question.
During testing, we rely on the LLM to interpret the users' input prompts and output the different routing tokens when needed. That's why we need to construct instruction templates for different tasks and finetune the LLM, which is specified in Sec.3.2 (1) and Appendix E.
Here is an example for detection. USER: Where can we locate the dog in the image? ASSISTANT: The detection results for dog [DET] are presented.
During training, you will input the [DET] and corresponding super-link queries $Q{link}$ into LLM to obtain $H{link}$, which is then sent to the downstream decoder.
During testing, since the input prompt does not include the [DET] and $Q{link}$, so how can you get the $H{link}$ ?
Is it correct that during training, the downstream decoders receive $H{link}$ while during testing they receive $Q{link}$?
Is there any inconsistency in the input of downstream decoders during training and testing?
During testing, the LLM will output [DET], and we immediately append the $Q_{link}$ after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + numembeds, C]. We can still get the last-layer hidden states $H{link}$ during testing.
This part is the code for handling the super-link queries, which works well for both training and testing: https://github.com/OpenGVLab/VisionLLM/blob/34a8144829c30361adccc15e6b2a25c80a62d1fd/VisionLLMv2/visionllmv2/model/modeling_visionllmv2.py#L421
During testing, the LLM will output [DET], and we immediately append the $Q_{link}$ after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + numembeds, C]. We can still get the last-layer hidden states $H{link}$ during testing.
This part is the code for handling the super-link queries, which works well for both training and testing: https://github.com/OpenGVLab/VisionLLM/blob/34a8144829c30361adccc15e6b2a25c80a62d1fd/VisionLLMv2/visionllmv2/model/modeling_visionllmv2.py#L421
During testing, the LLM will output [DET], and we immediately append the Qlink after it. Then, in the current generation step, the input_embeds will expand from [1, C] to [1 + num_embeds, C]. We can still get the last-layer hidden states Hlink during testing.
This part is the code for handling the super-link queries, which works well for both training and testing:
Does this mean that during training, the earlier superlink embeddings do not attend to later ones due to the causal attention mask, but during test, different superlink embeddings get to attend to each other as one forward pass is used to get all their hidden states?
During both training and testing, the LLM always uses the causal mask.
As mentioned in your paper, the Super-Link Queries are automatically added after the input embeddings of the routing token. However, during testing, users' input prompts do not include any routing token. How can you send the Super-Link Queries to MLLM and obtain the corresponding hidden states $H_{link}$?