Open hustzyj opened 2 months ago
Apologies for the oversight. In the text, we did not provide a detailed description. CLIP-txt involves using the CLIP text encoder and image encoder to separately encode the input emotionless prompt (e.g., "skirt") and the generated emotion image, after which the cosine similarity is computed. CLIP-img, on the other hand, calculates the cosine similarity of features between images generated by the model using both the emotionless prompt ("skirt") and the emotional prompt ("
Hi, I want to know how to calculate the metrics of emotion transfer task in this paper, especially CLIP-img and CLIP-txt. I didn't find the calculation method in the related literature.