YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

The accuracy following esc50 Recipe is very low #17

Closed nikhilbyte closed 2 years ago

nikhilbyte commented 2 years ago

There must be some mistake from my side. Can someone help me identify it?Screenshot 2021-08-29 at 5 42 13 PM This is how I'm training: !python -W ignore /content/ast/src/run.py --model ast --dataset esc50 \ --data-train /content/data/datafiles/esc_train_data_1.json --data-val /content/data/datafiles/esc_eval_data_1.json --exp-dir /content/expdir/fold1 \ --label-csv /content/ast/egs/esc50/data/esc_class_labels_indices.csv --n_class 50 \ --lr 1e-5 --n-epochs 25 --batch-size 12 --save_model False \ --freqm 24 --timem 96 --mixup 8 --bal None \ --tstride 10 --fstride 10 --imagenet_pretrain True --audioset_pretrain True

YuanGongND commented 2 years ago

Hi there,

What I noticed is that mixup should be in the range of [0-1], so 8 is out of range. Specifically for ESC-50, we use the cross-entropy loss, so you should set mixup=0, or if you want to use mixup, change the loss in traintest.py to BCELoss, we recommend to set mixup=0. You also used a smaller batch-size, which might has some impact (in that case, you need to use a smaller learning rate). There might be some other things, the model is not trained at all.

Anyway, the easiest way to reproduce the result is using the hyper-parameters in the recipe, if you don't have enough GPU memory (we use 4 * 12GB GPUs), it is better to set the tstride and fstride = 16 (which saves a lot of GPU memory) and keep other things unchanged. You won't get the same result as in the paper for larger strides, but it should be similar.

-Yuan

YuanGongND commented 2 years ago

Also, are you using our prep_data.py? It converts the audios to 16kHz sampling rate, the original is 44.1kHz.

nikhilbyte commented 2 years ago

Hello Yuan, Thanks for the quick reply.

Also, are you using our prep_data.py? It converts the audios to 16kHz sampling rate, the original is 44.1kHz.

Yes I used the prep_data.py without changing anything.

Changing the mixup to the default value helped getting the accuracy higher. Thanks a lot.

YuanGongND commented 2 years ago

Great to know. Also you can compare your log with our log file to see if everything is going as expected.

nikhilbyte commented 2 years ago

Yes, I did compare it with the log files you provided. Thanks