YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

Normalizing the train and test data #48

Closed ranjith1604 closed 2 years ago

ranjith1604 commented 2 years ago

You have mentioned that if we want to use your pre-trained model, we need to take care of the input normalization. In your code, I observed that you have manually added the mean and std for each of the datasets you used. How are we supposed to calculate the mean and std of our own dataset? Do we calculate it after computing the fbank for each audio signal or is it calculated from raw audio form? It would be great if you could provide some clarity on this

kremHabashy commented 2 years ago

You should be able to calculate this info for your dataset using get_norm_stats.py.

ranjith1604 commented 2 years ago

Thank you for the help. Also a follow up question. While using the pretrained model, the input_tdim is the dimension of the time axis after forming the spectrogram right?

YuanGongND commented 2 years ago

Yes, it is the number of frames, for 1-second audio, there should be around 100 frames (we use hop of 10ms). If you use AudioSet pertaining, it is important to generate the spectrogram in the same setting with us, i.e., this.