YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

Question about wav2fbin detail #13

Closed daisukelab closed 2 years ago

daisukelab commented 2 years ago

Hi, thank you for sharing the reproducible code.

Let me have questions about the detail for getting fbanks.

  1. According to the paper, Hamming window would be used. But following code uses Hann. Then the Hanning is the one actually used? https://github.com/YuanGongND/ast/blob/master/src/dataloader.py#L129

  2. All other parameters to get the fbanks are the default, right?

Thanks in advance!

YuanGongND commented 2 years ago

Hi there,

  1. Yes, I think it is a typo in the paper. I think the difference between the two windows is small, so you can use either one if you train your model from scratch, but it might be better if you keep using the Hann window for the AudioSet pretrained models as we train the model with the Hann window.

  2. Please refer to the torchaudio document, I see the htk_compat is also different from the default value.

-Yuan

daisukelab commented 2 years ago

Hi Yuan, thank you for your very quick comments.

OK, I will keep using Hann. And I found that the torchaudio.compliance.kaldi.fbank() makes big difference with the dataset closer to the AudioSet. The performance jumped for these datasets. Interestingly, non-ESC dataset performance didn't almost change. (compared to the locally converted mel-spectrogram by using torchaudio.transforms.MelSpectrogram.)

Thank you for your support. I appreciate, and I hope to publish using these results soon...

YuanGongND commented 2 years ago

It's good to know. Thanks!

I would suggest also taking care of the normalization - for new datasets, I suggest using its dataset mean / std for normalization rather than re-using AudioSet mean / std. You can check the other issue in the repo about that.

Good luck with your publication.

-Yuan

daisukelab commented 2 years ago

Hi Yuan, Yes, it's been implemented. My system handles normalization by default, then it is quite compatible with your method. :) Thanks again!