Validation loss vs Training loss in AudioSet training

YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BSD 3-Clause "New" or "Revised" License

1.13k stars 212 forks source link

Validation loss vs Training loss in AudioSet training #31

Open Tomlevron opened 3 years ago

Tomlevron commented 3 years ago

Hi!

First of all i would like to thank you for sharing with everyone your amazing work! Truly inspiring and fascinating work you shard with us.

I have a question regarding the differences of the training loss and the validation loss. It seems that the validation loss is much higher than the training loss, is that make sense? isn't it overfitting?

I also tried to fine tune the Audioset trained model for my data and is showed the same differences (with and without augmentations).

Here is an example from the logs: test-full-f10-t10-pTrue-b12-lr1e-5/log_2090852.txt:

train_loss: 0.011128
valid_loss: 0.693989

I'm still new to deep learning so maybe I'm missing something.

Thank you!

YuanGongND commented 2 years ago

Thanks for your interest.

I think it is not an overfitting issue as you should also see a performance drop in mAP or accuracy on the validation set if the model is overfitted. I think the reason is that we added a Sigmoid function on top of the output of the model in the inference stage (but not in the training stage) before loss computation to make sure mAP/accuracy is calculated correctly. It changes the validation loss. See here.

-Yuan

hbellafkir commented 2 years ago

Wouldn't it be wrong to train with Softmax and use Sigmoid for mAP? Using Softmax instead of Sigmoid gives a higher mAP value.

YuanGongND commented 2 years ago

Could you elaborate on this point?

I think we did not use softmax during training, the reason why we added an extra Sigmoid in inference but not in training is that BCEWithLogitsLoss already includes one Sigmoid.

hbellafkir commented 2 years ago

In the case of CrossEntropyLoss, the loss is calculated with Softmax (here) as it is included in the CrossEntropyLoss operation. It is not correct to use Sigmoid for inference when CrossEntropyLoss is used in training for my understanding. on a custom dataset that I use, switching from Sigmoid to Softmax results in a higher mAP value during inference.

hbellafkir commented 2 years ago

In the case of CrossEntropyLoss, the loss is calculated with Softmax (here) as it is included in the CrossEntropyLoss operation. It is not correct to use Sigmoid for inference when CrossEntropyLoss is used in training for my understanding. on a custom dataset that I use, switching from Sigmoid to Softmax results in a higher mAP value during inference.

@YuanGongND any thoughts on this?

YuanGongND commented 2 years ago

Yes - I think you can skip the Sigmoid in inference. That was just used to make training/inference consistent for the multi-label classification (i.e., one audio has more than one label) tasks.

When you use CrossEntropyLoss, I assume you have a single-label dataset, using Softmax here might improve mAP, but won't improve accuracy, but mAP is less important for single-label classification, that's why we use accuracy in the ESC-50 and SpeechCommands recipe.

For multi-label classification, adding Sigmoid won't change mAP either as Sigmoid is monotonic, so I think you can also remove that, but that could impact the ensemble performance.

YuanGongND commented 2 years ago

After some investigation, it seems to be a logging bug. The train and eval loss difference is over-estimated in the code.

In traintest.py, the loss_meter is cleaned up every epoch, but the average is printed out every 1000 iterations. So the large loss value at early iterations accumulates.

Changing 'loss_meter.avg' to 'loss_meter.val' at here can alleviate this problem. But I would suggest doing an offline loss evaluation (i.e., check the training loss using the best checkpoint model after the training process finishes), that would be the most accurate solution.