Closed Basums closed 2 years ago
One important reason to do input normalization is we use ImageNet pertained checkpoint model, which is trained with a dataset with 0 mean and 0.5/1 std (note it is dataset mean/std, we should not normalize each sample to 0 mean and unit std). Doing some kind of normalization should in generally helps model training but we haven't explored it in detail.
Have you tried using the same image-Pretraining model to increase performance for SSAST? What is the effect
Image pretraining can be only applied to 16*16 patch-based AST, while SSAST can use patches of any size and shape. Trivially combining ImageNet pretraining and SSAST does not lead to immediate performance improvement in my experiment.
Dear Yuan Gong, It is my honor to send you a message about your SSAST paper.Your work is a very have contribution to the work, but I have a question, for this kind of emergent task, the features of the input is normalized, (for example Fbank), according to the data set only do normalization, enough (for example, only use audioset normalized variance do), but if add CMVN, gradient explosion would happen, I doubt it. Does the input necessarily need to be normalized according to a certain mean and variance? LayerNorm seems to converge faster but fine-tunes is worse. Can you answer this question for me? I would appreciate it. I hope you'll write back. Good luck with your studies.