Closed ENDGU closed 7 months ago
Thank you very much for your patient reply, which has helped me a lot!
Regarding the aspect of text processing, your Figure 3 mentions that text input has a [CLS] and [MASK] special flag, which does not seem to be included in the CLIP model's text processing. Are you using the BERT model's tokenizer to embed text?
'[CLS] a photo of human [MASK] computer' represents the template triplet. To learn the predicate representation, here we use [MASK] special flag to represent unseen predicates, which should be replaced with the corresponding text. [CLS] represents the CLIP model’s pooled outputs and doesn't need input. So an example input would be 'a photo of human manipulation computer'.
The following text mentions that the predicate uses a Glove vector for embedding. Does it mean that in the input sentence, only the predicate uses a Glove vector or is it the entire sentence?
It seems to use a clip tokenizer for the embedding. (Is that you're asking)
My confusion is how the Glove vector mentioned here is used.This paragraph comes from Appendix F of the paper.
We have already updated several confusing mentions in the newest version (v2 in Aug).
Thank you very much!!!!!
The CLIP model text embedding vector that I previously learned about seems to be [batch_size,77,512], and the image embedding is [batch_size,50,512]? What do the lengths 4 and 2 in the picture mean?
The embedding is shaped as [batch_size, seq_length, embedding_size], where the lengths 4 and 2 are incorporated into the seq_length dimension.
Thank you, I'll try again.
Hello, sorry to bother you again.I am very interested in the EPIC module section of your paper and have been trying to reproduce it recently.But I encountered some difficulties.In the image input section, you used three images in the Figure 3, representing union region, sub, and obj. May I ask: 1: Should the corresponding images be pre cut or how to obtain them? 2: Assuming the image has been cropped, is the CLIP model used to process the three images and then cat them as embeddings or simply add them together?