Open thirteen-bears opened 1 year ago
hi there,
Thanks for the question. No, we didn't split a validation set from AudioSet, the reason is that some sound class only has ~50 samples in the dataset, it is hard to split a meaningful validation set without hurting the training performance.
We nor select the model based on the test set mAP, but instead train the model with a fixed training schedule and either use weight averaging or the last epoch as the final model. It isn't perfect, but in practice choosing any model after a certain epoch would lead to very similar mAP on the evaluation set, particually with weight averaging strategy (i.e., average the weight of last few checkpoints element-wise, please check our PSLA paper).
For the other question in the email, we have released the training / test ids at https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists. I hope this helps. My personal recommendation is: if you want to compare with CAV-MAE, the easiest way is to take our numbers from the paper (no experiment needed), followed by evaluating our checkpoint on your evaluation set (so remove the impact on eval set difference), followed by train a CAV-MAE with your training data with our recipe (so remove the impact of training and eval data difference). To faliciate the reproduction, we also released our training log that contains all hyper-parameters that you would need at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset/training_logs and https://github.com/YuanGongND/cav-mae/tree/master/egs/vggsound/training_logs.
-Yuan
hi there,
Thanks for the question. No, we didn't split a validation set from AudioSet, the reason is that some sound class only has ~50 samples in the dataset, it is hard to split a meaningful validation set without hurting the training performance.
We nor select the model based on the test set mAP, but instead train the model with a fixed training schedule and either use weight averaging or the last epoch as the final model. It isn't perfect, but in practice choosing any model after a certain epoch would lead to very similar mAP on the evaluation set, particually with weight averaging strategy (i.e., average the weight of last few checkpoints element-wise, please check our PSLA paper).
For the other question in the email, we have released the training / test ids at https://github.com/YuanGongND/cav-mae#audioset-and-vggsound-data-lists. I hope this helps. My personal recommendation is: if you want to compare with CAV-MAE, the easiest way is to take our numbers from the paper (no experiment needed), followed by evaluating our checkpoint on your evaluation set (so remove the impact on eval set difference), followed by train a CAV-MAE with your training data with our recipe (so remove the impact of training and eval data difference). To faliciate the reproduction, we also released our training log that contains all hyper-parameters that you would need at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset/training_logs and https://github.com/YuanGongND/cav-mae/tree/master/egs/vggsound/training_logs.
-Yuan
Thanks a lot for your detailed reply, Yuan. I have another direct question for fine-tuning validation. In Table 1 of your CAV-MAE paper, you give the mAP fine-tuned on AudioSet 20K. The model is retrained on AudioSet-2M. Does the mAP calculated on AudioSet 20K on the certain epoch?
hi there,
In table 1, the AS-20K model is pretrained on AudioSet-2M (without label) and finetuned on AS-20k data (with label).
As I mentioned before, a weight averaging (WA) strategy is used, i.e., we average the weight of checkpoints start from the 3rd epoch to the end (the 15th epoch). This makes the mAP less sensitive to the number of training epochs. You can verify this by yourself, averaging 3-15, 4-15, 5-15 etc will lead to very similar results.
With the time, my memory becomes blurry on the details, but we documented and released all these hyperparameters, along with the training logs at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset.
Specifically for the WA on AS-20K:
https://github.com/YuanGongND/cav-mae/blob/cd810cb54c020bcc1afbebaf2a57876e02ed6f7b/egs/audioset/run_cavmae_ft_bal.sh#L36-L37 https://github.com/YuanGongND/cav-mae/blob/master/egs/audioset/run_cavmae_ft_bal.sh
-Yuan
hi there,
In table 1, the AS-20K model is pretrained on AudioSet-2M (without label) and finetuned on AS-20k data (with label).
As I mentioned before, a weight averaging (WA) strategy is used, i.e., we average the weight of checkpoints start from the 3rd epoch to the end (the 15th epoch). This makes the mAP less sensitive to the number of training epochs. You can verify this by yourself, averaging 3-15, 4-15, 5-15 etc will lead to very similar results.
With the time, my memory becomes blurry on the details, but we documented and released all these hyperparameters, along with the training logs at https://github.com/YuanGongND/cav-mae/tree/master/egs/audioset.
Specifically for the WA on AS-20K:
https://github.com/YuanGongND/cav-mae/blob/master/egs/audioset/run_cavmae_ft_bal.sh -Yuan
Thanks a lot for your detailed reply!
I think that I find what I want from your "sh" file. You use "audio-eval" as the test dataset. (I think I should use the word "test set" rather that "validation set"~)
Also, thanks a lot for your explanation for "WA". It really helps me to better understand the code.
-Han
Thanks.
For the training pipeline, please see this paper for details: https://arxiv.org/pdf/2102.01243.pdf, it should contain everything you need.
Whenever the dataset has an official validation set, you should use the validation set to select the model.
Thanks a lot. I will read this paper. It seems that this paper will help me a lot!
In your paper, you mentioned using Audioset-2M and Audioset-20k for fine-tuning during your experiments. However, I am curious about the process of splitting the training and validation data during the fine-tuning stage.
I know that in VGGsound dataset, they have already split training and validation samples.
Could you kindly elaborate on how this division was carried out for Audioset? Since in the fine-tuning stage, we always need to split the dataset into training and validation data. Do you use Audioset-2M or Audioset-20k to finetune and use Audioset-Eval to validate?