YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

Not found the sample_video_extract_list.csv #22

Closed JackieWang9811 closed 8 months ago

JackieWang9811 commented 8 months ago

Hi,Dr Gong!

Thanks for release the code of CAV-MAE. It's not difficult to see that it is a great work! My only question is, what you mentioned in the README

_Both scripts are simple, you will need to prepare a csv file containing a list of video paths (see for an example src/preprocess/sample_video_extract_list.csv)_ but I haven't found it. For those who are first to this field, it may be difficult to follow.

Can you upload it? Thanks a lot!

YuanGongND commented 8 months ago

hi there,

Thanks so much, just updated.

Please see https://github.com/YuanGongND/cav-mae/blob/master/src/preprocess/sample_video_extract_list.csv

It is simple but with the attached video files, you can have a quick try (i.e., you do not need to find external video files).

-Yuan

YuanGongND commented 8 months ago

please let me know if it does not work.

JackieWang9811 commented 8 months ago

please let me know if it does not work.

Hi,Dr Gong! Thanks a lot for quick reply, it works . Here i have anthor question, about the Table 16 in paper,

Snipaste_2023-11-08_14-42-01

what confuses me is, after shuffling the matching pairs, the results of CVA-MAE and AV-MAE are the same .The mismatching pairs will inevitably collapse the objective of contrastive learning to a certain extent. From my subjective point of view, why are the results of CVA-MAE not worse?

Is there a good explanation for this result? Thanks a lot!

YuanGongND commented 8 months ago

thanks, it is a good catch.

Section J and table 16 are in appendix but took us a lot of time to do it. The results convey a lot of information, including the fact that AV-MAE itself does not learn audio-visual correspondence (with our architecture, it might work in other architectures).

The mismatching pairs will inevitably collapse the objective of contrastive learning to a certain extent.

This is true. But the results are joint a-v classification with full finetuning. Note, full finetuning could override a lot.

Everything we think that we can confident to say about is in Section J. Table 16 honestly shows what we got from the experiments.

It would be nice to have an independent issue for different questions as it would make people easier to search.

-Yuan