YuanGongND / cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
BSD 2-Clause "Simplified" License
214 stars 20 forks source link

what is the validation set for finetuning? #19

Open thirteen-bears opened 8 months ago

thirteen-bears commented 8 months ago

In your paper, you mentioned using Audioset-2M and Audioset-20k for fine-tuning during your experiments. However, I am curious about the process of splitting the training and validation data during the fine-tuning stage.

I know that in VGGsound dataset, they have already split training and validation samples.

Could you kindly elaborate on how this division was carried out for Audioset? Since in the fine-tuning stage, we always need to split the dataset into training and validation data. Do you use Audioset-2M or Audioset-20k to finetune and use Audioset-Eval to validate?

YuanGongND commented 8 months ago

hi there,

Thanks for the question. No, we didn't split a validation set from AudioSet, the reason is that some sound class only has ~50 samples in the dataset, it is hard to split a meaningful validation set without hurting the training performance.

We nor select the model based on the test set mAP, but instead train the model with a fixed training schedule and either use weight averaging or the last epoch as the final model. It isn't perfect, but in practice choosing any model after a certain epoch would lead to very similar mAP on the evaluation set, particually with weight averaging strategy (i.e., average the weight of last few checkpoints element-wise, please check our PSLA paper).

For the other question in the email, we have released the training / test ids at https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists. I hope this helps. My personal recommendation is: if you want to compare with CAV-MAE, the easiest way is to take our numbers from the paper (no experiment needed), followed by evaluating our checkpoint on your evaluation set (so remove the impact on eval set difference), followed by train a CAV-MAE with your training data with our recipe (so remove the impact of training and eval data difference). To faliciate the reproduction, we also released our training log that contains all hyper-parameters that you would need at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset/training_logs and https://github.com/YuanGongND/cav-mae/tree/master/egs/vggsound/training_logs.

-Yuan

thirteen-bears commented 8 months ago

hi there,

Thanks for the question. No, we didn't split a validation set from AudioSet, the reason is that some sound class only has ~50 samples in the dataset, it is hard to split a meaningful validation set without hurting the training performance.

We nor select the model based on the test set mAP, but instead train the model with a fixed training schedule and either use weight averaging or the last epoch as the final model. It isn't perfect, but in practice choosing any model after a certain epoch would lead to very similar mAP on the evaluation set, particually with weight averaging strategy (i.e., average the weight of last few checkpoints element-wise, please check our PSLA paper).

For the other question in the email, we have released the training / test ids at https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists. I hope this helps. My personal recommendation is: if you want to compare with CAV-MAE, the easiest way is to take our numbers from the paper (no experiment needed), followed by evaluating our checkpoint on your evaluation set (so remove the impact on eval set difference), followed by train a CAV-MAE with your training data with our recipe (so remove the impact of training and eval data difference). To faliciate the reproduction, we also released our training log that contains all hyper-parameters that you would need at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset/training_logs and https://github.com/YuanGongND/cav-mae/tree/master/egs/vggsound/training_logs.

-Yuan

Thanks a lot for your detailed reply, Yuan. I have another direct question for fine-tuning validation. In Table 1 of your CAV-MAE paper, you give the mAP fine-tuned on AudioSet 20K. The model is retrained on AudioSet-2M. Does the mAP calculated on AudioSet 20K on the certain epoch?

YuanGongND commented 8 months ago

hi there,

In table 1, the AS-20K model is pretrained on AudioSet-2M (without label) and finetuned on AS-20k data (with label).

As I mentioned before, a weight averaging (WA) strategy is used, i.e., we average the weight of checkpoints start from the 3rd epoch to the end (the 15th epoch). This makes the mAP less sensitive to the number of training epochs. You can verify this by yourself, averaging 3-15, 4-15, 5-15 etc will lead to very similar results.

With the time, my memory becomes blurry on the details, but we documented and released all these hyperparameters, along with the training logs at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset.

Specifically for the WA on AS-20K:

https://github.com/YuanGongND/cav-mae/blob/cd810cb54c020bcc1afbebaf2a57876e02ed6f7b/egs/audioset/run_cavmae_ft_bal.sh#L36-L37 https://github.com/YuanGongND/cav-mae/blob/master/egs/audioset/run_cavmae_ft_bal.sh

-Yuan

thirteen-bears commented 8 months ago

hi there,

In table 1, the AS-20K model is pretrained on AudioSet-2M (without label) and finetuned on AS-20k data (with label).

As I mentioned before, a weight averaging (WA) strategy is used, i.e., we average the weight of checkpoints start from the 3rd epoch to the end (the 15th epoch). This makes the mAP less sensitive to the number of training epochs. You can verify this by yourself, averaging 3-15, 4-15, 5-15 etc will lead to very similar results.

With the time, my memory becomes blurry on the details, but we documented and released all these hyperparameters, along with the training logs at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset.

Specifically for the WA on AS-20K:

https://github.com/YuanGongND/cav-mae/blob/cd810cb54c020bcc1afbebaf2a57876e02ed6f7b/egs/audioset/run_cavmae_ft_bal.sh#L36-L37

https://github.com/YuanGongND/cav-mae/blob/master/egs/audioset/run_cavmae_ft_bal.sh -Yuan

Thanks a lot for your detailed reply!

I think that I find what I want from your "sh" file. You use "audio-eval" as the test dataset. (I think I should use the word "test set" rather that "validation set"~)

Also, thanks a lot for your explanation for "WA". It really helps me to better understand the code.

-Han

YuanGongND commented 8 months ago

Thanks.

For the training pipeline, please see this paper for details: https://arxiv.org/pdf/2102.01243.pdf, it should contain everything you need.

Whenever the dataset has an official validation set, you should use the validation set to select the model.

thirteen-bears commented 8 months ago

Thanks a lot. I will read this paper. It seems that this paper will help me a lot!