Closed lijuncheng16 closed 2 years ago
Some benchmark on 4 V-100 clusters, takes about 48 hrs to finish 5 epoches on the full audioset to get to 0.44 ish mAP without ensembling.
Hi, my feeling is there's something wrong with your implementation. While I think using 64 frequency bins would lead to slightly worse performance, I don't believe it will totally fail.
A few comments:
-Yuan
And yes, normalization is important. Normalizing input to N(0, 1), N(0, 0.5), or N(0, 0.25) lead to similar performance when you train your model from scratch with ImageNet pertaining, but if you want to use our AudioSet pretrained model, please stick to our original implementation, just for consistency.
Further, our training pipeline consists of some small techniques (e.g., normalization), if you want to reproduce the exact same result as us, please use our training recipe, which is fully open in this repo.
Thank you for your prompt response!
Yes, input normalization might be the trick you need, and ImageNet pertaining is crucial.
Updates: Normalization indeed is the "trick". Now I have a model gets to 0.244 mAP at least. There's a catch, my original feature was normalized per Mel channel, (in my case there are 64 mels) whereas your normalization is on the entire dataset agnostic of Mel channels. Therefore, your feature's variance is theoretically 1/sqrt{n}
of my feature, n is the number of Mels. Everything should be 0 mean, so no effect there. Indeed, the current working model is done by my feature divided by 8. What a surprising number, but makes sense. (P.S. My feature divided by 2 didn't work! and I did some extra thinking)
And yes, normalization is important. Normalizing input to N(0, 1), N(0, 0.5), or N(0, 0.25) lead to similar performance when you train your model from scratch with ImageNet pertaining, but if you want to use our AudioSet pretrained model, please stick to our original implementation, just for consistency.
Further, our training pipeline consists of some small techniques (e.g., normalization), if you want to reproduce the exact same result as us, please use our training recipe, which is fully open in this repo.
Hi, there: Thank you for open sourcing this piece of implementation! It is very inspiring to see timm works in the audio settings.
Q: I tried the pipeline with a smaller feature size e.g. 64x400, and end up with 39x5 patches, and AST would be stuck at 0.01 mAP. Tried upsampling to your feature size 128x1024, and brought it up to 0.10 mAP. I guess your intuition is to "take advantage of" the 384x384 position (originally 576 n_patches), so 1212 patches would be roughly 2x the 576 patches. Still curious is there a way to do this with a smaller feature dimension.