Closed zheedong closed 9 months ago
Hi, Thanks for your attention. During the training process of SEED Tokenizer, Qformer extracts text features and image features at the same time, like blip2. We do not use an additional text encoder. We use vit-g in eva-clip as the image encoder. For image features, we use the last one in the causal queries for contrastive learning.
Hi, thank you for your comment. I have another question. In stage II, how reconstruction of causal embedding is conducted? I thought it is to maximize cosine similarity between original causal embedding from Q Foremr and reconstructed embedding after multi layer transformer. Is it right?
Yes, you can use l2 loss or maximize cosine similarity. When using cosine similarity, please note that the features should be normalized.
Great work. Thank you for your research results.
I'd like to know which text encoder did you use in training process. Did you use OpenCLIP ViT-H/14 for text encoder and image encoder?
And I would like to know more details about contrastive learning. How contrastive learning applied for Causal Embeddings and Text Embeddings? Dimension of each of them is equal?
Thank you.