YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

Usage of audio-modality components for visual embeddings #7

Closed gchochla closed 1 year ago

gchochla commented 1 year ago

Hey! In the following line of code from the decoder of CAV-MAE: https://github.com/YuanGongND/cav-mae/blob/6cb02fe785f1f8cd3529376074652db08733c674/src/models/cav_mae.py#L328, you are using audio components to create the visual inputs to the decoder. Was this a deliberate choice? (To me, it looks like you copy-pasted the code from the audio modality and forgot to make some changes). Thanks!

YuanGongND commented 1 year ago

hi there,

Thanks for the question.

https://github.com/YuanGongND/cav-mae/blob/6cb02fe785f1f8cd3529376074652db08733c674/src/models/cav_mae.py#L323

and

https://github.com/YuanGongND/cav-mae/blob/6cb02fe785f1f8cd3529376074652db08733c674/src/models/cav_mae.py#L328

seems to be same, but the audio one is x[:, **(0): self.patch_embed_a.num_patches-int(mask_a[0].sum())**, :] while the visual one is x[:, **self.patch_embed_a.num_patches-int(mask_a[0].sum()):(-1)**, :]. I.e., the visual token starts from the end of the audio tokens because in encoder, audio and visual tokens are concatenated with audio first and visual second.

https://github.com/YuanGongND/cav-mae/blob/6cb02fe785f1f8cd3529376074652db08733c674/src/models/cav_mae.py#L299

Am I correct?

-Yuan

gchochla commented 1 year ago

Ah yes, you’re right, thanks for the quick response! I missed that the : changed place.