YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

Is normalization right? #5

Closed pgzhang closed 2 years ago

pgzhang commented 2 years ago

Hell0, I'm new in audio classification, and i want to know is this normalization right? fbank = (fbank - self.norm_mean) / (self.norm_std * 2) does it should be : fbank = (fbank - self.norm_mean) / (self.norm_std ** 2)

YuanGongND commented 2 years ago

Hi there,

It is a great question.

More standard method is fbank = (fbank - self.norm_mean) / self.norm_std (not self.norm_std * 2). However, we find normalizing the input with a smaller std (i.e., divide by a larger number) can slightly improve the performance for ImageNet pretrained model. So we normalize the input with (self.norm_std 2). This is potentially because the distribution of the audio spectrogram is not same with the images in ImageNet, smaller std helps improve the transfer learning performance. But it is a minor point, if you train your model from scratch without ImageNet pretraining, I think you can just use standard normalization, if you train your model with ImageNet pretraining, using standard normalization should still be OK with a very minor performance decrease. We have done a non-formal test and found the model performance is not sensitive to the normalization parameters.

However, if you want to use our AudioSet pretrained model for any downstream task, please keep our normalization (i.e., using 2X self.norm_std) because the AudioSet model is pretrained with the normalized input and any change of the input scale would lead to a big performance drop.

-Yuan

pgzhang commented 2 years ago

Hi there,

It is a great question.

More standard method is fbank = (fbank - self.norm_mean) / self.norm_std (not self.norm_std * 2). However, we find normalizing the input with a smaller std (i.e., divide by a larger number) can slightly improve the performance for ImageNet pretrained model. So we normalize the input with (self.norm_std 2). This is potentially because the distribution of the audio spectrogram is not same with the images in ImageNet, smaller std helps improve the transfer learning performance. But it is a minor point, if you train your model from scratch without ImageNet pretraining, I think you can just use standard normalization, if you train your model with ImageNet pretraining, using standard normalization should still be OK with a very minor performance decrease. We have done a non-formal test and found the model performance is not sensitive to the normalization parameters.

However, if you want to use our AudioSet pretrained model for any downstream task, please keep our normalization (i.e., using 2X self.norm_std) because the AudioSet model is pretrained with the normalized input and any change of the input scale would lead to a big performance drop.

-Yuan

I see, thanks for your explanation!

lijuncheng16 commented 2 years ago

OMG! N( 0, 0.25) distribution. Didn't realize there's this assumption.