Hi, I have question about how stage I training is conducted.
In paper, you say 'We use contrastive loss to maximize the similarity between the final causal embeddings and text features of the corresponding caption'. And what do you mean about 'final' causal embedding? It means last token of causal embedding? (dimension [batch, 1, 768]) or last layer of causal embedding? [batch, 32, 768])?
Hi, I have question about how stage I training is conducted.
In paper, you say 'We use contrastive loss to maximize the similarity between the final causal embeddings and text features of the corresponding caption'. And what do you mean about 'final' causal embedding? It means last token of causal embedding? (dimension [batch, 1, 768]) or last layer of causal embedding? [batch, 32, 768])?