YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Wonderful work! questions about feature size #51

Closed lijuncheng16 closed 2 years ago

lijuncheng16 commented 2 years ago

Hi, there: Thank you for open sourcing this piece of implementation! It is very inspiring to see timm works in the audio settings.

Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP. Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.

lijuncheng16 commented 2 years ago

Some benchmark on 4 V-100 clusters, takes about 48 hrs to finish 5 epoches on the full audioset to get to 0.44 ish mAP without ensembling.

YuanGongND commented 2 years ago

Hi, my feeling is there's something wrong with your implementation. While I think using 64 frequency bins would lead to slightly worse performance, I don't believe it will totally fail.

A few comments:

  1. When you say 64x400, do you mean AudioSet? If it is a new dataset, do you have a baseline model and know roughly what number is expected? If it is AudioSet, I don't think you should trim the time dimension, so something 64x1000 is more reasonable.
  2. It is not likely due to positional embedding because we also tested on SpeechCommands that is 128x100, which works quite well.
  3. Are you using the AudioSet pretrained model or not? If so, that model is pretrained with 128 freq bins and can only transfer if your task also uses 128 freq bins.
  4. I would suggest trying the ESC-50 recipe using our pipeline but just change the freq dimension to 64, that should be a quick experiment.
  5. We use 4*Titanx GPUs, training 5 epochs takes about a week for the full AudioSet.

-Yuan

YuanGongND commented 2 years ago

And yes, normalization is important. Normalizing input to N(0, 1), N(0, 0.5), or N(0, 0.25) lead to similar performance when you train your model from scratch with ImageNet pertaining, but if you want to use our AudioSet pretrained model, please stick to our original implementation, just for consistency.

Further, our training pipeline consists of some small techniques (e.g., normalization), if you want to reproduce the exact same result as us, please use our training recipe, which is fully open in this repo.

lijuncheng16 commented 2 years ago

Thank you for your prompt response!

YuanGongND commented 2 years ago

Yes, input normalization might be the trick you need, and ImageNet pertaining is crucial.

lijuncheng16 commented 2 years ago

Updates: Normalization indeed is the "trick". Now I have a model gets to 0.244 mAP at least. There's a catch, my original feature was normalized per Mel channel, (in my case there are 64 mels) whereas your normalization is on the entire dataset agnostic of Mel channels. Therefore, your feature's variance is theoretically 1/sqrt{n} of my feature, n is the number of Mels. Everything should be 0 mean, so no effect there. Indeed, the current working model is done by my feature divided by 8. What a surprising number, but makes sense. (P.S. My feature divided by 2 didn't work! and I did some extra thinking)

lijuncheng16 commented 2 years ago

And yes, normalization is important. Normalizing input to N(0, 1), N(0, 0.5), or N(0, 0.25) lead to similar performance when you train your model from scratch with ImageNet pertaining, but if you want to use our AudioSet pretrained model, please stick to our original implementation, just for consistency.

Further, our training pipeline consists of some small techniques (e.g., normalization), if you want to reproduce the exact same result as us, please use our training recipe, which is fully open in this repo.