CrossmodalGroup / HREM

Learning Semantic Relationship among Instances for Image-Text Matching, CVPR, 2023
Apache License 2.0
89 stars 8 forks source link

Question about the loss function using in your code #10

Open TitleZ99 opened 1 month ago

TitleZ99 commented 1 month ago

First of all, I would like to express my respect and gratitude to you for accomplishing such an inspiring work. In the paper, you used a basic triplet loss in the design of the loss function, but used a loss function that includes distanceweightedminer of metric learning in the code. I don't know much about this loss function, is this a loss function you designed or from other work? I would very much like to learn more about this loss function. But I didn't find much information about it. Can you teach me why this loss function is used and provide relevant papers to help me learn? Thanks and salutations again, looking forward to your reply. 1729003600816

darkpromise98 commented 1 month ago

The distanceweightedminer comes from the previous metric learning work [1]. It will sample the negative samples based on the similarity/distance of negative pairs. The greater the similarity, the higher the probability of being selected. You can understand this as a mining strategy of negative samples.

Compared with the original triplet loss (just select the hardest negative samples), the distanceweightedminer selects soft negative samples and bring robust training.

[1] sampling matters in deep embedding learning, ICCV, 2017

darkpromise98 commented 1 month ago

Besides, the code implementation comes from the official repo [2] (the image/single-modal retrieval task), and I did minor modification for cross-modal task.

[2] https://github.com/chaoyuaw/incubator-mxnet/blob/master/example/gluon/embedding_learning/model.py#L61

TitleZ99 commented 1 month ago

Thank you very much for the timely response, I will learn from the response you have given.