Closed ranjith1604 closed 2 years ago
You should be able to calculate this info for your dataset using get_norm_stats.py.
Thank you for the help. Also a follow up question. While using the pretrained model, the input_tdim is the dimension of the time axis after forming the spectrogram right?
Yes, it is the number of frames, for 1-second audio, there should be around 100 frames (we use hop of 10ms). If you use AudioSet pertaining, it is important to generate the spectrogram in the same setting with us, i.e., this.
You have mentioned that if we want to use your pre-trained model, we need to take care of the input normalization. In your code, I observed that you have manually added the mean and std for each of the datasets you used. How are we supposed to calculate the mean and std of our own dataset? Do we calculate it after computing the fbank for each audio signal or is it calculated from raw audio form? It would be great if you could provide some clarity on this