YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

Where is contrastive loss implemented? How are the positive and negative samples defined? #23

Closed ben2002chou closed 7 months ago

ben2002chou commented 7 months ago

I have a question regarding the code and the paper. I can't Identify where the contrastive loss code is, and how the positive and negative samples are defined. Looking at your paper SSAST has helped give me a vague idea of how contrastive loss might have been implemented (seemingly by matching masked patches), but I would like to further look into the code and understand your implementation. Could you possibly give me some more explanation on where and how this is implemented?

YuanGongND commented 7 months ago

hi there,

thanks so much for the question.

I can't Identify where the contrastive loss code is

https://github.com/YuanGongND/cav-mae/blob/68fe8c2a3917dc2926e41f796bfdcb331a64b42c/src/models/cav_mae.py#L353-L373

The reason you cannot find is likely due to for Contrastive Learning, loss is defined in the model, not the training pipeline, due to implementation consideration.

and how the positive and negative samples are defined

A-V are naturally paired data, for a batch of data, say 64, you have 64 audios and 64 videos, there will be only 1 positive a-v pair, and all other 63 are negative.

-Yuan

ben2002chou commented 7 months ago

Thank you so much for your answer!