Closed gchochla closed 1 year ago
hi there,
Thanks for the question.
and
seems to be same, but the audio one is x[:, **(0): self.patch_embed_a.num_patches-int(mask_a[0].sum())**, :]
while the visual one is x[:, **self.patch_embed_a.num_patches-int(mask_a[0].sum()):(-1)**, :]
. I.e., the visual token starts from the end of the audio tokens because in encoder, audio and visual tokens are concatenated with audio first and visual second.
Am I correct?
-Yuan
Ah yes, you’re right, thanks for the quick response! I missed that the :
changed place.
Hey! In the following line of code from the decoder of CAV-MAE: https://github.com/YuanGongND/cav-mae/blob/6cb02fe785f1f8cd3529376074652db08733c674/src/models/cav_mae.py#L328, you are using audio components to create the visual inputs to the decoder. Was this a deliberate choice? (To me, it looks like you copy-pasted the code from the audio modality and forgot to make some changes). Thanks!