YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

some problem about finetuning #27

Open thirteen-bears opened 5 months ago

thirteen-bears commented 5 months ago

Dear Dr Gong,

I tried to fine-tune the audio encoder on ESC50 dataset using the pretrained CAV-MAE model. But the performance is far from expectation. I listed all the details and tricks that I have used. I wonder if I missed anything during finetuning.

I train your audio-MAE (one branch of CAV-MAE) model (using VIT-B and batch size = 256) on K400 training dataset for 200 epochs and then load the model and do fine-tune on ESC50.

ESC50 dataset has 5 seconds audio. Thus I use the parameter: num_melbin = 128, target length = 512, instead of target length = 1024 for K400 and AudioSet (the length of audio is 10s ). I do not load the positional embedding part in the pertaining model since there is a mismatch in positional embedding .

During finetuning, I followed the same data augmentation masking frequency domain and masking time domain with hyper parameter in your "SSAST" folder for ESC50. I also use the "scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=lr_patience, verbose=True)" to tune the learning rate. I also tried several fixed learning rate and head learning rate ratio with learning decay. However, the validation accuracy keeps around 60% and can hardly goes up.

I found another paper VICREG. In their appendix, they just use supervised learning for ESC50 with RESNET18 backbone without any pretraining model and can get 72.7% accuracy (See the image below). I just wonder if there is anything that I missed for finetuning.

Screenshot 2024-02-07 at 23 09 37
YuanGongND commented 5 months ago

hi there,

thanks for the question.

I tried to fine-tune the audio encoder on ESC50 dataset using the pretrained CAV-MAE model. But the performance is far from expectation. I listed all the details and tricks that I have used. I wonder if I missed anything during finetuning.

I train your audio-MAE (one branch of CAV-MAE) model (using VIT-B and batch size = 256) on K400 training dataset for 200 epochs and then load the model and do fine-tune on ESC50.

This sounds not straightforward to me. You can just take out pretrained model and ft on ESC-50 (no need for K400). We provide many checkpoints.

ESC50 dataset has 5 seconds audio. Thus if I use the parameter: num_melbin = 128, target length = 512, instead of target length = 1024 for K400 and AudioSet (the length of audio is 10s ).Thus I do not load the positional embedding part in the pertaining model since there is a mismatch in positional embedding .

Not loading positional embedding is certainly not correct. You can 1/ pad audio to 10s or 2/ trim the positional embedding to the first 5s (but be careful on reshaping things). That is, if the pretrained pos_embed is in shape [batch_size, time_length=1024, dim], and your audios are 5s, do something like pos_embed = pos_embed[:, : 500, :].

Practical solution is to start from https://github.com/YuanGongND/cav-mae/blob/master/egs/audioset/run_cavmae_ft_bal_audioonly.sh, keep most hyper-parameters unchanged, do not do another round of K400 training (which could has a bug in your code), only do necessary change, e.g., change to CE loss and ACC metrics for ESC-50, try different learning rate.

https://github.com/YuanGongND/cav-mae/blob/68fe8c2a3917dc2926e41f796bfdcb331a64b42c/egs/audioset/run_cavmae_ft_bal_audioonly.sh#L68.

In the first try, just pad the audio to 10s (our script can handle this automatically), this leads to minimal risk to introducing a bug. Also double check your audio is in 16kHz (esc-50 comes with 44.1kHz).

I found another paper VICREG. In their appendix, they just use supervised learning for ESC50 with RESNET18 backbone without any pretraining model and can get 72.7% accuracy (See the image below). I just wonder if there is anything that I missed for finetuning.

We can only promise all results in the paper can be reproduced. For this promise, we release checkpoints, actual training scripts, and logs. We can not support request on other datasets that is not in the paper (and since I don't see the code, it is hard to judge).

But I agree that 60% on ESC-50 is very low. This is a script that you can run in one-click and get ~95% accuracy. You can start from this script.

https://github.com/YuanGongND/ast?tab=readme-ov-file#esc-50-recipe