YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.17k stars 221 forks source link

Overfitting? test AST on GTZAN #62

Open kelvinqin opened 2 years ago

kelvinqin commented 2 years ago

Hi Dear Gong Yuan, Thanks for your excellent work, I learnt a lot.

After understanding your code (partially), I test it on GTZAN using my own training framework (Just copy your ast_model.py)

GTZAN has 2000 music clips with roughly 30s duration per clip. My approach is to chop each clip into 2s segments to construct my data-set for training and testing (chop 70% clips to construct training set, remaining 30% clips as testing)

I tested both imagenet pretrained and imagenet + audioset pretraing model, accuracy looks comparable, but both reach a troublesome situation --- testing loss keep increasing while training loss decreasing. Looks like it is a typical overfitting.

Not sure if you have met the same situation when work on either ESC-50 or SpeechCommand? Let me paste my loss curves and accuracy here. 1> testing loss vs. training loss ast

2> testing accuracy: acc

This accuracy looks competitive but not SOTA, because in another CNN approach, I can reach 92% accuracy easily on the same dataset.

My dataloader is very similar with yours, torchaudio to extract fbank in 128 dim, spec_augmentation, (0,0.5) normalization, etc.

I am using all same configuration as you suggested, batch size 48, lr (1e-04 or 1e-05), etc.

I don't worry its accuracy but more worry how can I make testing loss decreasing.

Thanks! Kelvin

kelvinqin commented 2 years ago

Dear Gong Yuan, I have a typo in my previous post, "I can reach 92% accuracy easily on the same dataset." here 92% is the accuracy on whole 30s clips. (My AST model by transfer-learning from imagenet + audioset pretrained model, I got 90% on clip leveal)

Best regards, Kelvin

YuanGongND commented 2 years ago

Hi Kelvin,

First, you have 2000 * 30 clips and then crop each to 2 seconds? Why not crop it to 10s? Our model is trained on 10s and should well on that setting

How did you instantiate your AST model? what is the t_dim? For 2 seconds, it should be ~200, for 10s, it should be 1024.

When the input length is small, you will need to modify the specaug parameters. But I feel it should be better to crop to 10s.

And what is your learning rate scheduler? Have you tried using a smaller learning rate?

I'd suggest using your SOTA CNN hyper-parameters (e.g., batch size) and recipe (e.g., clip level eval), but just keep the input normalization the same as us and search the initial learning rate and learning rate scheduler.

AST does converge faster (i.e., overfit faster) and needs a smaller learning rate. On ESC-50 (also 2000 samples), when the learning rate/learning rate scheduler is set correctly, the test performance always improves, see https://github.com/YuanGongND/ast/blob/master/egs/esc50/exp/test-esc50-f10-t10-pTrue-b48-lr1e-5/result.csv

-Yuan

YuanGongND commented 2 years ago

FYI - in my own experiments, I use exactly the same hyparameters for CNN and AST except for the learning rate (AST uses 10 times smaller LR), see https://arxiv.org/pdf/2203.06760.pdf table 2.

kelvinqin commented 2 years ago

Dear Gong Yuan, Thanks so much for your flash quick response :-)

For 2s, I initialized AST model using 192 t-dim. Yes, I will follow your suggestion to switch to 10s.

About spec_aug, yes,I thought of that but did not try it due to my slow machine. Currently I am using 24 freqmask and 48 timemask, but I think you are right, I should using small timemask (maybe 12?)

On LR and scheduler, I am try 1e-06 now. My scheduler is the same as you suggest in your paper --- Keep the original LR in the first 5 epoches (I am using 1e-04 on imagenet pretrained model, 1e-05 on imagenet + audioset pretrained model), after that, low down it by 0.85*previous LR in each epoch.

On normalization, yes, I noticed the importance of normalization, I run "get_stats" to calculate my norm_mean and norm_std (GTZAN has norm_mean = -2.4281502, norm_std = 2.9490466)

Actually I have a silly question on your normalization implementation: As we knew, fbank is a vector in 128 dimension, when calculate its mean and std, why not to calculate vector mean and vector std? For example, cur_mean = torch.mean(audio_input(dim = 0)), cur_std = torch.std(audio_input(dim=0)? But instead, you just calculate scalar mean and scalar std?

Yes, you are right, AST converged really fast, after one epoch, I got 80% accuracy as you see from my log: ---------------AST Model Summary--------------- ImageNet pretraining: True, AudioSet pretraining: True frequncey stride=10, time stride=10 number of patches=216 Training ... 100%|██████████| 679/679 [05:17<00:00, 2.14it/s] epoch: 0, elapsed: 363.70125126838684, loss tr: 319.794189453125, loss_dev: 26.83715057373047, lr: 1e-05, acc: 80.21939136588819 100%|██████████| 679/679 [05:18<00:00, 2.13it/s] epoch: 1, elapsed: 365.1444020271301, loss tr: 110.05419921875, loss_dev: 25.748031616210938, lr: 1e-05, acc: 82.09483368719037 100%|██████████| 679/679 [05:18<00:00, 2.13it/s] epoch: 2, elapsed: 365.36025762557983, loss tr: 69.80587768554688, loss_dev: 25.884685516357422, lr: 1e-05, acc: 81.52866242038216 100%|██████████| 679/679 [05:20<00:00, 2.12it/s] epoch: 3, elapsed: 366.29729437828064, loss tr: 57.776466369628906, loss_dev: 26.530099868774414, lr: 1e-05, acc: 82.44869072894551

Thanks for your great work and especially thanks all for all the nice suggestions, I will report back my finding,

Kelvin

kelvinqin commented 2 years ago

Dear Gong Yuan, Seems you are correct, after switch to 1e-06, seems it is on its way now :-) But I need to go to my sleeping now, will check in the morning, ... thanks so much.

Training ... 100%|██████████| 679/679 [05:21<00:00, 2.11it/s] epoch: 0, elapsed: 368.61199736595154, loss tr: 762.400146484375, loss_dev: 30.097900390625, lr: 1e-06, acc: 75.9377211606511 100%|██████████| 679/679 [05:17<00:00, 2.14it/s] epoch: 1, elapsed: 364.9519441127777, loss tr: 365.77099609375, loss_dev: 24.98992919921875, lr: 1e-06, acc: 79.15782024062278 100%|██████████| 679/679 [05:20<00:00, 2.12it/s] epoch: 2, elapsed: 366.54775309562683, loss tr: 271.9261474609375, loss_dev: 23.39752197265625, lr: 1e-06, acc: 80.5378627034678 100%|██████████| 679/679 [05:24<00:00, 2.09it/s] epoch: 3, elapsed: 370.4417667388916, loss tr: 217.48388671875, loss_dev: 24.134933471679688, lr: 1e-06, acc: 81.28096249115357 100%|██████████| 679/679 [05:27<00:00, 2.07it/s] epoch: 4, elapsed: 374.67887783050537, loss tr: 179.55203247070312, loss_dev: 22.4140625, lr: 1e-06, acc: 81.17480537862703

Kelvin

YuanGongND commented 2 years ago

Thanks.

From the learning curve, 1e-6 seems not to lead to performance over 90%.

Trim to 2s and 10s makes the task difficulty different, I would still suggest to try 10s, get the norm stats for 10s audios (likely to be different with 2s), and search the learning rate.

As we knew, fbank is a vector in 128 dimension, when calculate its mean and std, why not to calculate vector mean and vector std? For example, cur_mean = torch.mean(audio_input(dim = 0)), cur_std = torch.std(audio_input(dim=0)? But instead, you just calculate scalar mean and scalar std?

I guess that would change the distribution and structure of the dataset.

-Yuan

YuanGongND commented 2 years ago

Or, you can test your CNN baseline model with the 2s audios and compare the accuracy. In general, I think keeping the training/eval pipeline consistent makes a more fair comparison.

kelvinqin commented 2 years ago

Dear Gong Yuan, Finally I saw this result, just a quick update (1e-06 on 2s), I am happy to I saw a reasonable dev-loss curve, but as you predict, it did not lead to a great performance: 1e-06_2s I will draw result on 10s soon, thanks so much, Kelvin,

kelvinqin commented 2 years ago

Dear Gong Yuan, It took me some efforts to do 10s trial. I follow most of your setting on ESC-50. Here is the current setting: 1> chop 30s utterance into 10s 2> enable warm-up 3> using imagenet + audioset pretrained model 4> initial LR = 1e-04 (should I try 1e-05?) 5> scheduler: torch.optim.lr_scheduler.MultiStepLR(optimizer, list(range(5,26)), gamma=0.85) 6> norma stats: norm_mean = -3.2265673, norm_std = 3.1692033 (calculate on GTZAN 10s data) 7> t_dim: 1024 8> AST initialization: ASTModel(input_tdim= 1024, label_dim=10,imagenet_pretrain=True, audioset_pretrain=True,fstride=10, tstride=10) 9> mixup = 0.0 (no mixup) 10> with spec_augmentation: freqm = 48, timem = 192 11> no add_noise

Here is my result --- still see overfitting (not sure if you experienced that same on small dataset?): 10s_curve

Thanks a lot for your help and insight! Kelvin

kelvinqin commented 2 years ago

Dear Gong Yuan, In order to resolve the "overfitting" problem, I have two tiny things to do, 1> add a dropout: self.mlp_head = nn.Sequential(nn.LayerNorm(self.original_embedding_dim), nn.Dropout(0.5), nn.Linear(self.original_embedding_dim, label_dim)) 2> using smaller initial LR: 1e-05

Thanks for any of your suggestion, Kelvin

YuanGongND commented 2 years ago

I think it is worth trying a smaller learning rate.

Add more time shifting as data augmentation might improve the performance a little bit (https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/dataloader.py#L205), but that wouldn't be dramatic.

You can also try https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L56 to see what is the optimal learning rate scheduler.

What is your loss function? If it is a single-label classification problem (each clip has only one loss), you can also try CE loss instead of BCE.

In general, I am wondering how you test your CNN model that achieves 92%? Is that possible to train AST with similar setting with it expect using a smaller learning rate and trimming audios to 10s?

kelvinqin commented 2 years ago

Dear Yuan, You never sleep :-)

Thanks so much for your guidance,

My loss function is BCEWithLogits, but previously I am using CE, I think they are identical in my 10-classes & single label music genre classification problem. To switch to BCEWithLogstic from CE, I copy your code to scalar label to lable vectors as following:

label_indices = np.zeros(self.label_num) label_indices[self.target[index]] = 1.0 label_indices = torch.from_numpy(label_indices).float()

The way for me to achieve 92% using CNN is: 1> chop 30s utterance into 2s 2> build CNN to classify each 2s segment 3> to majority voting when decode 30s utterance (segment by segment) in runtime

So here when I work on AST solution, I am following the same logic,

Have a nice day, Kelvin

kelvinqin commented 2 years ago

Dear Gong Yuan, Seems 1e-05 LR works, I am crossing my finger now ...

10s_1e-05

Thanks! Kelvin

YuanGongND commented 2 years ago

Great to know thanks.

I just thought CE is the more "correct" loss for single-label classification.

https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L71

Anyways I won't have time to check the detail. Good luck with your project.

The reason I am suggesting using the same pipeline is in my experiments, I almost always find AST is slightly better than CNNs (my PSLA models), so I would be surprised otherwise.

kelvinqin commented 2 years ago

Dear Gong Yuan, I will do more experiment to verify that with a head-to-head comparing. Actually there are some minor difference in my experiments: 1> CNN, I am using torchlibrosa to extract logmelspectrogram and spec-augmentation 2> in AST, I am following your approach by calling torchaudio,

One more question in my mind --- when I build 10s AST model, I also tried mixup, but seems it cause model hard to coverage. I noticed that you only use mixup in Audioset and SpeechCommand, how did you decide whether to apply mixup in your research work in general?

I will make a fair comparing soon, Thanks so much, I learnt a lot from your paper and chatting, :-) Kelvin

YuanGongND commented 2 years ago

Mixup can sometimes dramatically improve the performance (see my PSLA paper). I didn't use that with CE loss because I thought that might make the optimization harder, but I think it worth to have a try since you already move to BCE.

kelvinqin commented 2 years ago

Dear Yuan Gang,

Finally I make AST competitive with CNN on GTZAN dataset, :-) but I do believe AST can out-perform CNN because there are so many things I still want to try with AST.

As a summary, I think the key things to make AST work on 10s segments are: 1> LR = 1e-05 2> disable spec-augmentation (if enable spec-augmentation, also works, with a little degradation) 3> #self.optimizer = torch.optim.Adam(parameters(), lr=lr, weight_decay=1e-3) self.optimizer = torch.optim.Adam(parameters(), lr=lr, weight_decay=5e-7, betas=(0.95, 0.999)) I did not notice that you are using a smaller weight_decay, previously I am always using 1e-3 4> normalization really matters, 5> disable mixup 6> disable add_noise 7> self.scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, list(range(5,26)), gamma=0.85) 8> maybe more.. which I can not remember them all,

It is a little strange that disable of spec-augmentation perform a little better, because usually data augmentation should help, but I think maybe it is because GTZAN's test-set is too small (101 30s utterances). And I will test Mixup and Spec-Augmentation on other dataset. Otherwise, I may draw a wrong conclusion by some random things. So you have any suggestion on which music related data I can play with? Thanks for your recommendation. (And I also heard that GTZAN has some labeling defects)

Here is the result, thanks so much, 10s_1e-05_no_spec_aug

Thanks so much for all the kindly help which you provided to me,

Have a nice day, Kelvin

kelvinqin commented 2 years ago

Dear Gong Yuan,

I forgot one thing --- with the above 10s segmentation-bases classifier, I tested the accuracy on 30s utterance level using majority voting approach (3 segments with 10s shift or more segment by a smaller shifting), accuracy is 91% :-)

On CNN, I got 92%, but that is maybe due to a lot of tuning work. Anyway, it is really comparable, and I feel very happy for all the efforts because AST converge so quick. Thanks for your work!

Have a nice day, Kelvin

YuanGongND commented 2 years ago

Thanks, I learned a lot from your experience.

There are certainly many things to tune, but one thing I noticed is you can actually input the entire 30s to AST without majority voting, to solve the memory issue, you can set tstride and fstride as 16 (instead of 10). Another factor might be the batch size, you can try to search that.

I don't know much about music, PANNs (https://arxiv.org/pdf/1912.10211.pdf) report 91.5% on it, so your CNN is actually quite strong.

kelvinqin commented 2 years ago

Dear Gong Yuan, Today I am fighting with my model for the whole day --- I read your read-me for "pretrained model", where you said "Please note that we use 16kHz audios for training and test (for all AudioSet, SpeechCommands, and ESC-50), so if you want to use the pretrained model, please prepare your data in 16kHz."

Then I realized that all my previous experiments were wrong maybe, because my signal is in 22050 HZ :-(

So I start to down-sample my GTZAN data into 16KHZ, and try to repeat my experiment, but until now I still did not get the same performance yet. (what I got using 16HZ data is 88%, worse than the wrong experiment in 22KHZ in yesterday which is 91%)

Still fine-tune LR for 16KHZ now.

BTW, yesterday you mentioned that I can use:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=args.lr_patience, verbose=True)

to find optimal LR, can you elaborate it a little on how to please?

Thanks! Have a nice day, Kelvin

YuanGongND commented 2 years ago

Then I realized that all my previous experiments were wrong maybe, because my signal is in 22050 HZ :-(

So I start to down-sample my GTZAN data into 16KHZ, and try to repeat my experiment, but until now I still did not get the same performance yet. (what I got using 16HZ data is 88%, worse than the wrong experiment in 22KHZ in yesterday which is 91%)

Still fine-tune LR for 16KHZ now.

Without AudioSet pretraining, higher sampling rate is better for music; Our AudioSet pretrained model is trained with 16kHz, so I guess it transfers better when the sampling rate is consistent, but I haven't done any experiment to verify this.

When you use 10s audio input, have you tried to use our AudioSet norm stats?

BTW, yesterday you mentioned that I can use:

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=args.lr_patience, verbose=True)

to find optimal LR, can you elaborate it a little on how to please?

Just use this scheduler, and change scheduler.step() to scheduler.step(acc). Please see pytorch document.

kelvinqin commented 2 years ago

Hi Gong Yuan, Thanks so much for your guidance. With half day's efforts in my daytime, I successful made my 16KHZ build approach the same level of performance as 22KHZ build.

The key factors lie in 2 pieces: 1> using smaller LT 2.5e-06 in 16KHZ vs. 1e-05 in 22KHZ build

2> turn off mixup in 22KHZ build, both mixup and spec_augmentation were turn off in 16KHZ build, mixup was turn off, but spec_augmentation was turn on (I may do another experiment with spec_aug turn off also, but that experiment is in lower priority in my plan) As a summary, 16KHZ build achieve 90.1% accuracy finally.

In all my experiments, I am using your imagenet + audioset pretrained model.

I indeed applies audioset's norm stats in one of my experiments, my finding is that the performance is worse then using my own statistics. By looking at the value, I found that my norm_mean and norm_std are significantly different with audioset's, like the following: 1> audioset's norm stats: norm_mean = -4.2677393 norm_std = 4.5689974

2> my norm stats: norm_mean = -3.7041707 norm_std = 3.356901

Question: About mixup, actually in all my experiments (22KHZ or 16KHZ) I seldom saw its effectiveness. And I also noticed you actually only turn on mixup in speech-command experiment, can you share your experience/PoV on whether mixup is useful? And how to make it useful? And how you decide to use it or not? Thanks!

I have a guessing that mixup should be used very carefully, especially on how frequently it is used --- in your code, in 50% chance mixup will be called. Maybe I should try a smaller percentage because GTZAN only has 2000 utterances, say, to apply mix over 20% of data?

Thanks for your hints on how to search for the optimal LR using ReduceLROnPlateau, I will do some experiments on it soon, because if I look back now over many of my endless running during the past two days, one lesson learn is to set a correct LR is crucial, and I actually search for the optimal LR manually which is too less efficient I guess :-(.

Thanks so much, and have a nice day, Kelvin

1244547821 commented 1 year ago

Hi Gong Yuan, Thanks so much for your guidance. With half day's efforts in my daytime, I successful made my 16KHZ build approach the same level of performance as 22KHZ build.

The key factors lie in 2 pieces: 1> using smaller LT 2.5e-06 in 16KHZ vs. 1e-05 in 22KHZ build

2> turn off mixup in 22KHZ build, both mixup and spec_augmentation were turn off in 16KHZ build, mixup was turn off, but spec_augmentation was turn on (I may do another experiment with spec_aug turn off also, but that experiment is in lower priority in my plan) As a summary, 16KHZ build achieve 90.1% accuracy finally.

In all my experiments, I am using your imagenet + audioset pretrained model.

I indeed applies audioset's norm stats in one of my experiments, my finding is that the performance is worse then using my own statistics. By looking at the value, I found that my norm_mean and norm_std are significantly different with audioset's, like the following: 1> audioset's norm stats: norm_mean = -4.2677393 norm_std = 4.5689974

2> my norm stats: norm_mean = -3.7041707 norm_std = 3.356901

Question: About mixup, actually in all my experiments (22KHZ or 16KHZ) I seldom saw its effectiveness. And I also noticed you actually only turn on mixup in speech-command experiment, can you share your experience/PoV on whether mixup is useful? And how to make it useful? And how you decide to use it or not? Thanks!

I have a guessing that mixup should be used very carefully, especially on how frequently it is used --- in your code, in 50% chance mixup will be called. Maybe I should try a smaller percentage because GTZAN only has 2000 utterances, say, to apply mix over 20% of data?

Thanks for your hints on how to search for the optimal LR using ReduceLROnPlateau, I will do some experiments on it soon, because if I look back now over many of my endless running during the past two days, one lesson learn is to set a correct LR is crucial, and I actually search for the optimal LR manually which is too less efficient I guess :-(.

Thanks so much, and have a nice day, Kelvin

Ask a simple question, how to calculate mean and std?