AILab-CVC / SEED

Official implementation of SEED-LLaMA (ICLR 2024).
https://ailab-cvc.github.io/seed
Other
515 stars 30 forks source link

What is frozen text/image encoder? #13

Closed zheedong closed 4 months ago

zheedong commented 6 months ago

Great work. Thank you for your research results.

I'd like to know which text encoder did you use in training process. Did you use OpenCLIP ViT-H/14 for text encoder and image encoder?

And I would like to know more details about contrastive learning. How contrastive learning applied for Causal Embeddings and Text Embeddings? Dimension of each of them is equal?

Thank you.

sijeh commented 6 months ago

Hi, Thanks for your attention. During the training process of SEED Tokenizer, Qformer extracts text features and image features at the same time, like blip2. We do not use an additional text encoder. We use vit-g in eva-clip as the image encoder. For image features, we use the last one in the causal queries for contrastive learning.

zheedong commented 6 months ago

Hi, thank you for your comment. I have another question. In stage II, how reconstruction of causal embedding is conducted? I thought it is to maximize cosine similarity between original causal embedding from Q Foremr and reconstructed embedding after multi layer transformer. Is it right?

sijeh commented 5 months ago

Yes, you can use l2 loss or maximize cosine similarity. When using cosine similarity, please note that the features should be normalized.