Thank you very much for providing the code for experience.
I have questions about Text Decoder and Text Query.
You mentioned in the article that Text Decoder is implemented using Q-Former, but as far as I know Q-Former is used to encode image features and can be used to align images with text.
At the same time, you also mentioned in your paper “In this way, the text query Qt contains highlighted visual cues that are most related to the user instruction.” .
I would like to ask, are the features extracted by the Text Query you proposed and the original Q-Former based on text instructions the same? Also, can you provide relevant code to reproduce the results in Figure 6(High response areas with top scores to input question in Equation 1.)?
Looking forward to your reply! Thanks.
Thank you very much for providing the code for experience. I have questions about Text Decoder and Text Query. You mentioned in the article that Text Decoder is implemented using Q-Former, but as far as I know Q-Former is used to encode image features and can be used to align images with text. At the same time, you also mentioned in your paper “In this way, the text query Qt contains highlighted visual cues that are most related to the user instruction.” . I would like to ask, are the features extracted by the Text Query you proposed and the original Q-Former based on text instructions the same? Also, can you provide relevant code to reproduce the results in Figure 6(High response areas with top scores to input question in Equation 1.)? Looking forward to your reply! Thanks.