Closed kremHabashy closed 2 years ago
Hi there,
For the first question, we require a fixed input length for each task, but the length doesn't have to be 10s (the one for AudioSet). We use 5s for ESC-50 and 1s for Speechcommands, they are all fine (please check our recipe). You just need to specify input_tdim
as the number of frames (audio length * 100) when you instantiate the AST model. In general, you can use the mean/max audio length of your task for input_tdim
. For your example, probably the appropriate number is 100 or 200. When you set input_tdim=200
, the script cut/pad your audios to 200 frames instead of 1000 frames. Further, even if the model is pretrained with AudioSet with input_tdim=1000
, it can be transferred to your task with input_tdim=200
, the trick is positional embedding adaptation (handled automatically by our code, just set imagenet_pretrain=True
and audioset_pretrain=True
). In our ESC-50 recipe, we find AudioSet pretraining leads to ~6% improvement even when input_tdim
is not consistent in pretraining and fine-tunning.
Setting appropriate input_tdim
not only makes learning simpler but also significantly improves the computational efficiency (AST is O(n^2) where n is the length of the input).
ASTModel(label_dim=527, \
fstride=10, tstride=10, \
input_fdim=128, input_tdim=1024, \
imagenet_pretrain=True, audioset_pretrain=False, \
model_size='base384')
For the second question, can you explain more on "for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there?"? I think AST also works quite well without overlap, overlapping can improve the performance but not dramatically.
Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?
Yes, and we should avoid model makes predictions based on such nuisance factors. But that is not a problem of AST, but the dataset preparation. I think other DNN models will be impacted.
-Yuan
Hi Yuan,
Thank you for the quick reply. No further issues with the first question. The insight on input tdim is greatly appreciated. For the second question, my apologies, I did not mean mix between fbank and spectrogram, rather a mix of fbank and padded 0's (a patch containing both fbank information and blank space). When these patches are fed into the model for training, is there no impact on learning from that end?
The third question has also been adressed.
Thank you
Also, could you comment on the reasoning of the padded regions in the two images above being different colours despite using the same padding and the implications this may have on learning?
Thanks again
I think using the mixture of padding and valid spectrogram is fine as padding can be viewed as silence. The different color is due to the matplotlib's colormap setting, if you look at the values of the fbank matrix, they should be the same.
Hi Yuan, thank you for your answers, this can be closed now.
Hi Yuan,
Than you for this great work! I am currently fine tuning the models you produced for a project I am working on and really appreciate the opportunity you created for me. I had a question regarding the spectrograms (or fbanks) produced by the wav2vec function.
Currently, I am trying to prepare a dataset to match the requirements of the model but have stumbled upon something that grabs my attention: You have mentioned in the paper that the model acceps variable inputs. Taking a closer look, I have found that this is due to the added padding below the fbank, his is done to fix the input dimensions into the model. However, when I applied this on my own data, I saw that the padding was of different colors depending on the image when I converted them. Here are two examples: , Although I am aware that the values of the solid coloured areas are zeros, I worry that this is indicative of the same colour being attributed to a different value in different spectrograms and how that would imapt the understanding of the model of colour.
My second question is regarding the use of padding specifically. In the ViT paper as well as AST, images are fed through as a colelction of patches for learning. Any patches that are fully blank naturally would not be adding too much information to the model. However, for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there? Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?
Any insight on the above would be deeply appreciated. Thanks again