YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

About code"100-106" from dataloader.py #88

Open poult-lab opened 1 year ago

poult-lab commented 1 year ago

Dear Minster. Gong Thanks a lot for your pioneering work in the field of audio processing, and warmhearted comments every time. I have a question about using MixUp method in AST. Since I saw the code 102 from dataloader.py waveform = waveform - waveform.mean(). My question is why the waveform() need to be subtracted the mean of waveform(). That operation of subtracting from either the original MixUp or there is reason behind it?

YuanGongND commented 1 year ago

Hi there,

Thanks for reaching out.

I think waveform mean substraction is not related to mixup. Substracting the mean of the waveform is a quite commonly used method to remove the DC offset. The way that I do it before mixup is just for safe. I haven't conduct experiment on the impact of waveform mean substraction, but I guess the impact is minor as we do another normalization on the spectrogram afterwards. My guess is, if your training and test use a consistent dataloader, removing waveform = waveform - waveform.mean() would be fine. But since it is quite standard, I'd prefer to keep it there.

-Yuan

poult-lab commented 1 year ago

Thank you so much gentlemen.

poult-lab commented 1 year ago

Dear Minster. Gong Sorry to bother you again, I saw you use z-transformer normalization fbank = (fbank - self.norm_mean) / (self.norm_std * 2), and then mean and std are 0 and 0.5, respectively. According to I know, the general z-transformer normalization is fbank = (fbank - self.norm_mean) / (self.norm_std ), mean and std are 0 and 1. My question is, do you choose the former according to the experiment or there is a reason behind it?

YuanGongND commented 1 year ago

I think I answered this in a previous issue (see here).

You are exactly correct on that fbank = (fbank - self.norm_mean) / (self.norm_std ) is the standard method. But in my preliminary experiment, I found restricting the input to a smaller variance leads to a minor performance improvement when ImageNet pretraining is used. My guess, at that time, was that audio spectrogram's distribution is different from RGB image. However, in my follow-up experiments, I found the impact is very small. So you can use either one.

Again, I want to emphasize that - though it doesn't matter which to use, it is important to keep it consistent in training and inference, specifically, if you want to use our pretrained model, please stick to our dataloader without any change.

Finally, I recommand to first run our original code and see if you can reproduce our claimed results, if yes, then you can play with the model with various settings.

-Yuan

boschhd commented 1 year ago

Dear Yuan,

thank you for the great code repository and for maintaining it! I have a small follow up question regarding the waveform normalization (waveform = waveform - waveform.mean()) in dataloader.py and its absence in predict.py.

The dataloader is used for generating the normalization stats and training and it normalizes the waveform before the transition into the frequency domain. The predict code however loads the audio itself and has no waveform normalization. Do you think that makes a difference?

YuanGongND commented 1 year ago

@boschhd

hi Harald,

This is not intentional, https://github.com/YuanGongND/ast/blob/master/egs/audioset/inference.py is not authored by myself but from the community, I would have added waveform = waveform - waveform.mean() if I were the author.

Having that said, this is a minor thing and just remove the DC constant, if you check waveform.mean(), it is usually a small value.

The thing really makes big difference is the spectrogram normalization at https://github.com/YuanGongND/ast/blob/9e3bd9942210680b833b08c39d09f2284ddc4d1d/src/dataloader.py#L202.

Without DC removal, the code probably still runs well, without the fbank norm, the inference is almost sure to fail.

Finally, I recommend to use https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb for inference instead of inference.py. That is authored by myself and provide more functions (e.g., attention map).

-Yuan