YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
223 stars 22 forks source link

Finetune CAVMAE on ESC50 #8

Open kaiw7 opened 1 year ago

kaiw7 commented 1 year ago

Hi Yuan, did you finetune CAVMAE on ESC50 dataset? Could you advise me what is the training pipline? Thank you very much.

YuanGongND commented 1 year ago

No, we didn't do ESC-50 experiments with CAV-MAE. But I expect it has similar or better performance compared with AST.

In general, we cleaned and released all codes in the main manuscript and part of the appendix. It is hard for me to clean up the rest as I have limited time. ESC-50 experiments are not in the main manuscript or appendix, we honestly don't have that.

-Yuan

YuanGongND commented 1 year ago

You can refer to the audioonly recipe and AST esc50 recipe to do yourself.

kaiw7 commented 1 year ago

Hi Yuan, thanks for your suggestions. I tried the ESC50 but I just got about 88% accuracy. In the implementation, when I load the checkpoint and I got the error about the mismatch dimension of 'module.pos_embed_a'. I know this is caused by the different audio length. For ESC50, the target length is set as 512. What I was doing is that the parameter of 'module.pos_embed_a' is not loaded and update the 'module.pos_embed_a' with new seq_length from stracth during the ESC50 training. I am not sure if it affacts the performance.

YuanGongND commented 1 year ago

It would be much better to trim the module.pos_embed_a to desired length instead of randomly initializing it.

Another method is just to pad all ESC-50 recordings to 10s, the script should automatically do that.

Btw, 88% isn't bad for a model without supervised AudioSet training. For better results, start with an AudioSet supervised pretrained checkpoint, e.g., https://github.com/YuanGongND/cav-mae#cav-mae-pretrainedfinetuned-models.

kaiw7 commented 1 year ago

Hi Yuan, many thanks for your patient response. I try to trim the 'module.pos_embed_a' to match the desired length. First, after loading the pretrained model, the 'module.pos_embed_a' has a shape of [1, 512, 768]. Then, it is reshaped into [1, 768, 8, 64]. Because the length of ESC audios is about a half of audioset, the shape of desied positional embed is [1, 768, 8, 32]. Finally, it is reshaped into [1, 256, 768]. Do you think it is reasonable? In addition, I use the [16, 16] stride instead of [10, 10], I am not sure if stride has an important influence on performance? Whatis more, I would like to check with you whether the positional embedding is learnable or fixed?

YuanGongND commented 1 year ago

hi, I apologize but I don't have time to follow up on issues about a new application/dataset with CAV-MAE, especially for competing performance. Usually, some tuning is needed, e.g., learning rate, batch size, etc.

First, after loading the pretrained model, the 'module.pos_embed_a' has a shape of [1, 512, 768]. Then, it is reshaped into [1, 768, 8, 64]. Because the length of ESC audios is about a half of audioset, the shape of desied positional embed is [1, 768, 8, 32]. Finally, it is reshaped into [1, 256, 768].

You need to check the order of time and frequency dim. torch.reshape is different from torch.permute, when you convert [1,512,768] to [1,768,512], you need torch.permute instead of reshape. There might be other things need to take care.

In addition, I use the [16, 16] stride instead of [10, 10], I am not sure if stride has an important influence on performance?

Code in this repo only support [16,16] stride (no overlap). For other strides, you need to implement by yourself.

Whatis more, I would like to check with you whether the positional embedding is learnable or fixed?

Please check the code. I cannot recall, but I can recall performance-wise they are very similar.

I might not be able to follow up on this further.