YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

Some confuse about this paper and implement #17

Open skyzjsx opened 8 months ago

skyzjsx commented 8 months ago

I have some confuse about this paper.

   The first one is that contrastive learning loss function is consists of two parts (audio-to-visual similar distant and visual-to-audio similar distant) in usual. What is the reason of using the single visual-to-audio similar distant in this paper?

   And second one is that what is the implement of "frame aggregation"? In other word, how can I get the image frame from the whole video?

   The third one is what is the designing purpose of modality type embedding Ea and Ev?
YuanGongND commented 8 months ago

hi there,

For the first question:

The first one is that contrastive learning loss function is consists of two parts (audio-to-visual similar distant and visual-to-audio similar distant) in usual. What is the reason of using the single visual-to-audio similar distant in this paper?

It was just because simplicity while bi-directional loss would be also fine.

For implementation, you can easily switch to bi-directional by changing bidirect_contrast == True

https://github.com/YuanGongND/cav-mae/blob/cd810cb54c020bcc1afbebaf2a57876e02ed6f7b/src/models/cav_mae.py#L361-L373

For checkpoint - we have released one model trained with bi-directional loss: https://github.com/YuanGongND/cav-mae#cav-mae-pretrained-models-ablation-study, "CAV-MAE-Symc-Scale+", you can compare these two designs by yourself. I would expect there's minor change for joint classification, but may cause a difference for audio-visual retrieval.

For the second question:

And second one is that what is the implement of "frame aggregation"? In other word, how can I get the image frame from the whole video?

This is described in the paper, to extract 10 frames from the video, use https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/extract_video_frame.py, in training, a random frame is used, in inference, we do 10 times for each frame, and average the prediction (i.e., ensemble).

For the third question:

The third one is what is the designing purpose of modality type embedding Ea and Ev?

This is based on some heuristic - basically we want to let the unified Transformer layer to know what is audio what is video input, but it might not be necessary. We haven't test the impact of it.

-Yuan