dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

About text encoder #51

Closed liziming5353 closed 4 months ago

liziming5353 commented 5 months ago

May I ask what changes have been made to the text encoder BertLMHeadModel you are using compared to the original BERT? I cannot fully understand the code and there is no detailed introduction in the paper.

yanwei-li commented 5 months ago

Hi, we use the BertLMHeadModel from BLIP2, it mainly supports the pre-defined query in the self-attention and cross-attention in BERT, like this.

dragen1860 commented 5 months ago

@yanwei-li from my perspective, the Text Decoder in your paper should actually be called Text Encoder, but i don't why you call it text decoder. Thank you.

yanwei-li commented 5 months ago

Hi, because we adopt the BERT in a decoder manner, which takes images and text as input in cross attention.