Closed zhangron013 closed 1 month ago
Sorry for my late reply.
Yes, you could try different hyper-parameters and also find-tune the audio encoder.
thanks~
By the way, when I unfreeze the HTSAT to finetune , I find that as the epoch increases, the cider metrics is decreasing, but the loss is decreasing. For example, epoch 1: cider=0.2289, epoch 11: cider=0.1166. And in epoch 19, lots of the generated captions become meaningless, like:
'a sound is being recorded' or 'an object is playing a sound'
I think this may be overfitting. Have you encountered this with previous training? The following is my experimental configuration:
lr=5e-5
train datasets: the train part of Audiocaps, Clotho and Wavcaps
test datasets: the eval part of Clotho
load model: pretrained HTSAT and pretrained BART
当我将HTSAT解冻与整个模型一起微调时,我将学习率设为5e-5, 数据集是Audiocaps,Clotho,Wavcaps的训练集部分,我发现随之轮数增加,cider的指标在不断下降,但是loss却一直下降。比如:第一轮cider=0.2289,但第11轮就降到了0.1166.并且在第19轮时,生成了很多毫无意义的caption,例如:' a sound is being recorded', 'an object is playing a sound'。我感觉好像是过拟合了,请问您之前训练的时候遇到过这种情况吗?我的实验配置如下:
For pretraining on whole data, I didn't encounter this issue. When finetune on AudioCaps, overfitting usually occurs.
And looks like the cider score in your experiments is also a bit low.
For pretraining on whole data, I didn't encounter this issue. When finetune on AudioCaps, overfitting usually occurs.
And looks like the cider score in your experiments is also a bit low.
ok~ thanks for you reply!
Dear Author, hi! Thank you for providing such a wonderful job! I have tried using this framework to replace HTSAT with other Audio Encoders such as BEATs for experiments. Similarly, I used the pre-trained BEATs and froze it. I did not change the training hyperparameters, such as lr=1e-4. However, after ten rounds of training, the output results were many identical and meaningless captions. I wonder if this is due to incorrect settings of the learning rate or other hyperparameters?