YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Does the input necessarily need to be normalized according to a certain mean and variance? #57

Closed Basums closed 2 years ago

Basums commented 2 years ago

Dear Yuan Gong, It is my honor to send you a message about your SSAST paper.Your work is a very have contribution to the work, but I have a question, for this kind of emergent task, the features of the input is normalized, (for example Fbank), according to the data set only do normalization, enough (for example, only use audioset normalized variance do), but if add CMVN, gradient explosion would happen, I doubt it. Does the input necessarily need to be normalized according to a certain mean and variance? LayerNorm seems to converge faster but fine-tunes is worse. Can you answer this question for me? I would appreciate it. I hope you'll write back. Good luck with your studies.

YuanGongND commented 2 years ago

One important reason to do input normalization is we use ImageNet pertained checkpoint model, which is trained with a dataset with 0 mean and 0.5/1 std (note it is dataset mean/std, we should not normalize each sample to 0 mean and unit std). Doing some kind of normalization should in generally helps model training but we haven't explored it in detail.

Basums commented 2 years ago

Have you tried using the same image-Pretraining model to increase performance for SSAST? What is the effect

YuanGongND commented 2 years ago

Image pretraining can be only applied to 16*16 patch-based AST, while SSAST can use patches of any size and shape. Trivially combining ImageNet pretraining and SSAST does not lead to immediate performance improvement in my experiment.