YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

About random noise which author put on the speech command. #79

Closed poult-lab closed 1 year ago

poult-lab commented 1 year ago

Hello mister GONG, Thank you for your wonderful work, it really helps me a lot. I have a few question about the random noise which you had injected in the dataset of speech command. And we can see the random noise is useful for the dataset of speech command, May I know the deep reason? Why it is very useful for short time audio?

And also the code fbank = torch.roll(fbank, np.random.randint(-10, 10), 0) from 209th line in dataloader.py. why Mister GONG use this code to do the random noise. Where did Mister GONG got the inspiration for that code?

Looking forward to your reply.

YuanGongND commented 1 year ago

Hi there,

By random noise, we actually meant this https://github.com/YuanGongND/ast/blob/d7d8b4b8e06cdaeb6c843cdb38794c1c7692234c/src/dataloader.py#L208, i.e., adding a small value to the original input.

fbank = torch.roll(fbank, np.random.randint(-10, 10), 0) is the random time-shift, i.e., it shifts the input on the time dimension. I think there are many literatures on why adding a small noise (during training) is helpful. In short, we don't want the model to memorize the input/prediction pair, instead, we hope it has a better generalization ability. Avoid the model seeing repeated samples could alleviate the overfittng. A small time shift or a small noise does not change the semantic information of the audio, a good model is expected to still make correction predictions on these augmented samples. Note both augmentations is applied only in the training stage, not in the evaluation stage.

Having said that, I don't think these two lines of code actually change the performance much or at all - as the mixup and specaugmentation augmentations we used are stronger. Random noise or time-shift might not be necessary.

-Yuan

poult-lab commented 1 year ago

Thanks bro, I have learned a lot from your comments.

poult-lab commented 1 year ago

Hi there,

By random noise, we actually meant this

https://github.com/YuanGongND/ast/blob/d7d8b4b8e06cdaeb6c843cdb38794c1c7692234c/src/dataloader.py#L208

, i.e., adding a small value to the original input. fbank = torch.roll(fbank, np.random.randint(-10, 10), 0) is the random time-shift, i.e., it shifts the input on the time dimension. I think there are many literatures on why adding a small noise (during training) is helpful. In short, we don't want the model to memorize the input/prediction pair, instead, we hope it has a better generalization ability. Avoid the model seeing repeated samples could alleviate the overfittng. A small time shift or a small noise does not change the semantic information of the audio, a good model is expected to still make correction predictions on these augmented samples. Note both augmentations is applied only in the training stage, not in the evaluation stage.

Having said that, I don't think these two lines of code actually change the performance much or at all - as the mixup and specaugmentation augmentations we used are stronger. Random noise or time-shift might not be necessary.

-Yuan

poult-lab commented 1 year ago

Actually, I only have used time-shift code 'fbank = torch.roll(fbank, np.random.randint(-10, 10), 0)', and then my accuracy was increased a lot (roughly 7%). But I don't know the reason, even though that is random time-shift.

YuanGongND commented 1 year ago

Which dataset did you use? Is 7% a relative improvement? I am surprised by it.

poult-lab commented 1 year ago

Yeah, actually I made that dataset by myself. But that dataset based on ESC-50.

poult-lab commented 1 year ago

7% based on the conditions.

YuanGongND commented 1 year ago

I haven't done the experiment but my guess is there won't be a 7% improvement on the original ESC-50.

I also guess the benefit of random time-shift/noise decreases with the increase of data volume.

If you see a big improvement, probably you can also increase the shift range to see if there's a further improvement, (-10, 10) is very small compared with the 512 frames of ESC-50 audios (5 seconds).

-Yuan

poult-lab commented 1 year ago

Hello mister Gong, thank you for your kind reply. But I don't understand this sentence "the benefit of random time-shift/noise decreases with the increase of data volume". Do you mean we can decrease the data volume through random time-shift? "data volume" denotes ...?

YuanGongND commented 1 year ago

No, sorry for the confusion.

By data volume, I mean the size of your training set (the number of samples in your dataset). Usually when you have a smaller dataset, data augmentation is more helpful to alleviate the overfitting, that's why I said the benefit of data augmentation decreases with the increase of your data volume.

Anyways, it is not directly related to AST and I am not an expert on ML theory. That's just my guess - neither random noise/time-shift will have a 7% improvement on the original ESC-50.

poult-lab commented 1 year ago

Thank you so much Minster Gong, I will close this comment.