Closed Gyat closed 3 years ago
Hi, It looks like you're confusing the contrastive_align_loss with the contrastive_loss. In our paper and published results, we do not use the contrastive loss (which is akin to an image-text matching loss from other vision+language pre-training papers). We only left it in the code for completeness since it is something we tried at some point, and thought it would be useful if other users of our code base were interested in experimenting with it. For the two losses that we do use, read the following:
Contrastive align loss, which is calculated between the predictions of the decoder and the embedded representations of the text and the output of the cross encoder. Relevant lines in the code: https://github.com/ashkamath/mdetr/blob/fdee8c50d7bcf2ad09cc0d6b783a8333720e4048/models/mdetr.py#L81 , https://github.com/ashkamath/mdetr/blob/fdee8c50d7bcf2ad09cc0d6b783a8333720e4048/models/mdetr.py#L203, https://github.com/ashkamath/mdetr/blob/fdee8c50d7bcf2ad09cc0d6b783a8333720e4048/models/mdetr.py#L496
Contrastive alignment -> loss_contrastive_align that we just discussed above. Soft token prediction is loss_labels https://github.com/ashkamath/mdetr/blob/fdee8c50d7bcf2ad09cc0d6b783a8333720e4048/models/mdetr.py#L464
Hope this makes it more clear! :)
Hello,
This is in relation to the losses described in the paper and implemented in the codebase. Need your help in understanding the following:
"text_pooled_op": encoded_text.pooler_output if self.CLS is not None else None,
"img_pooled_op": img_memory[0] if self.CLS is not None else None, # Return the CLS token
which essentially means that we are deriving the embedded representation of the text from the BERT-based text backbone encoder's classification token and the image embedded representation is being derived from the output of the transformer encoder. Is this genuinely a discrepancy? If not, can you kindly point towards the snippet for these loss calculations where you are tapping in the decoder output?
Thank you.