Closed poult-lab closed 1 year ago
Hi there,
By random noise, we actually meant this https://github.com/YuanGongND/ast/blob/d7d8b4b8e06cdaeb6c843cdb38794c1c7692234c/src/dataloader.py#L208, i.e., adding a small value to the original input.
fbank = torch.roll(fbank, np.random.randint(-10, 10), 0)
is the random time-shift, i.e., it shifts the input on the time dimension. I think there are many literatures on why adding a small noise (during training) is helpful. In short, we don't want the model to memorize the input/prediction pair, instead, we hope it has a better generalization ability. Avoid the model seeing repeated samples could alleviate the overfittng. A small time shift or a small noise does not change the semantic information of the audio, a good model is expected to still make correction predictions on these augmented samples. Note both augmentations is applied only in the training stage, not in the evaluation stage.
Having said that, I don't think these two lines of code actually change the performance much or at all - as the mixup
and specaugmentation
augmentations we used are stronger. Random noise or time-shift might not be necessary.
-Yuan
Thanks bro, I have learned a lot from your comments.
Hi there,
By random noise, we actually meant this
, i.e., adding a small value to the original input.
fbank = torch.roll(fbank, np.random.randint(-10, 10), 0)
is the random time-shift, i.e., it shifts the input on the time dimension. I think there are many literatures on why adding a small noise (during training) is helpful. In short, we don't want the model to memorize the input/prediction pair, instead, we hope it has a better generalization ability. Avoid the model seeing repeated samples could alleviate the overfittng. A small time shift or a small noise does not change the semantic information of the audio, a good model is expected to still make correction predictions on these augmented samples. Note both augmentations is applied only in the training stage, not in the evaluation stage.Having said that, I don't think these two lines of code actually change the performance much or at all - as the
mixup
andspecaugmentation
augmentations we used are stronger. Random noise or time-shift might not be necessary.-Yuan
Actually, I only have used time-shift code 'fbank = torch.roll(fbank, np.random.randint(-10, 10), 0)', and then my accuracy was increased a lot (roughly 7%). But I don't know the reason, even though that is random time-shift.
Which dataset did you use? Is 7% a relative improvement? I am surprised by it.
Yeah, actually I made that dataset by myself. But that dataset based on ESC-50.
7% based on the conditions.
I haven't done the experiment but my guess is there won't be a 7% improvement on the original ESC-50.
I also guess the benefit of random time-shift/noise decreases with the increase of data volume.
If you see a big improvement, probably you can also increase the shift range to see if there's a further improvement, (-10, 10) is very small compared with the 512 frames of ESC-50 audios (5 seconds).
-Yuan
Hello mister Gong, thank you for your kind reply. But I don't understand this sentence "the benefit of random time-shift/noise decreases with the increase of data volume". Do you mean we can decrease the data volume through random time-shift? "data volume" denotes ...?
No, sorry for the confusion.
By data volume, I mean the size of your training set (the number of samples in your dataset). Usually when you have a smaller dataset, data augmentation is more helpful to alleviate the overfitting, that's why I said the benefit of data augmentation decreases with the increase of your data volume.
Anyways, it is not directly related to AST and I am not an expert on ML theory. That's just my guess - neither random noise/time-shift will have a 7% improvement on the original ESC-50.
Thank you so much Minster Gong, I will close this comment.
Hello mister GONG, Thank you for your wonderful work, it really helps me a lot. I have a few question about the random noise which you had injected in the dataset of speech command. And we can see the random noise is useful for the dataset of speech command, May I know the deep reason? Why it is very useful for short time audio?
And also the code
fbank = torch.roll(fbank, np.random.randint(-10, 10), 0)
from 209th line in dataloader.py. why Mister GONG use this code to do the random noise. Where did Mister GONG got the inspiration for that code?Looking forward to your reply.