Closed daisukelab closed 3 years ago
Hi there,
Yes, I think it is a typo in the paper. I think the difference between the two windows is small, so you can use either one if you train your model from scratch, but it might be better if you keep using the Hann window for the AudioSet pretrained models as we train the model with the Hann window.
Please refer to the torchaudio document, I see the htk_compat
is also different from the default value.
-Yuan
Hi Yuan, thank you for your very quick comments.
OK, I will keep using Hann.
And I found that the torchaudio.compliance.kaldi.fbank()
makes big difference with the dataset closer to the AudioSet.
The performance jumped for these datasets.
Interestingly, non-ESC dataset performance didn't almost change.
(compared to the locally converted mel-spectrogram by using torchaudio.transforms.MelSpectrogram.)
Thank you for your support. I appreciate, and I hope to publish using these results soon...
It's good to know. Thanks!
I would suggest also taking care of the normalization - for new datasets, I suggest using its dataset mean / std for normalization rather than re-using AudioSet mean / std. You can check the other issue in the repo about that.
Good luck with your publication.
-Yuan
Hi Yuan, Yes, it's been implemented. My system handles normalization by default, then it is quite compatible with your method. :) Thanks again!
Hi, thank you for sharing the reproducible code.
Let me have questions about the detail for getting fbanks.
According to the paper, Hamming window would be used. But following code uses Hann. Then the Hanning is the one actually used? https://github.com/YuanGongND/ast/blob/master/src/dataloader.py#L129
All other parameters to get the fbanks are the default, right?
Thanks in advance!