dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

Questions about Text Decoder and Text Query #80

Open xiaokj37 opened 3 months ago

xiaokj37 commented 3 months ago

Thank you very much for providing the code for experience. I have questions about Text Decoder and Text Query. You mentioned in the article that Text Decoder is implemented using Q-Former, but as far as I know Q-Former is used to encode image features and can be used to align images with text. At the same time, you also mentioned in your paper “In this way, the text query Qt contains highlighted visual cues that are most related to the user instruction.” . I would like to ask, are the features extracted by the Text Query you proposed and the original Q-Former based on text instructions the same? Also, can you provide relevant code to reproduce the results in Figure 6(High response areas with top scores to input question in Equation 1.)? Looking forward to your reply! Thanks.