I was wondering why you are expecting two outputs when calling word_features, sentence_features = self.clip_model.encode_text(clip_tokens)here
As far as I understand, you are using a vanilla clip model which outputs only one embedding for clip_model.encode_text().
Evidently this cant be the case since you are expecting two different embeddings. So where did you implement the custom functionality to get two embeddings from encode_text()?
I was wondering why you are expecting two outputs when calling
word_features, sentence_features = self.clip_model.encode_text(clip_tokens)
hereAs far as I understand, you are using a vanilla clip model which outputs only one embedding for
clip_model.encode_text()
. Evidently this cant be the case since you are expecting two different embeddings. So where did you implement the custom functionality to get two embeddings fromencode_text()
?