Closed jokingww closed 2 years ago
For this setting, we directly feed the last multi-modal feature maps into the classifier, and calculate the cross entropy loss function.
What about when using the contrastive loss function? are they both used? The code provided is using only the cross entropy loss function.
What about when using the contrastive loss function? are they both used? The code provided is using only the cross entropy loss function.
Have the same question, it would be better if Mr. Derrick could point out where the contrastive loss is calculated.
In the ablation study of this paper, one way is to remove the text-to-pixel contrastive learning, I wonder what is the loss function to replace the text-to-pixel contrastive loss.