Closed liziming5353 closed 4 months ago
Hi, we use the BertLMHeadModel
from BLIP2, it mainly supports the pre-defined query
in the self-attention and cross-attention in BERT, like this.
@yanwei-li from my perspective, the Text Decoder
in your paper should actually be called Text Encoder
, but i don't why you call it text decoder. Thank you.
Hi, because we adopt the BERT in a decoder manner, which takes images and text as input in cross attention.
May I ask what changes have been made to the text encoder BertLMHeadModel you are using compared to the original BERT? I cannot fully understand the code and there is no detailed introduction in the paper.