Open Yuanbo2020 opened 2 years ago
Hi Yuanbo,
Instead of padding the audio to 1024 frames, it might worthwhile to try to instantiate the AST model with t_dim=100
for your 1-second audio.
If you have some training data, I think you can try to fine-tune the AST model.
When you say
Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.
Do you mean you view the class with the largest logit as the prediction? AST model is trained with BCE loss so the output logits are not normalized for all classes.
-Yuan
Hi Yuan,
Thank you so much for replying.
I am a little bit confused, could you please tell me which trained AST model can receive t_dim=100?
Because when loading the parameters of the trained AST, the dimension of pos_embed is torch.Size([1, 1214, 768]), and if t_dim is set to 100, the corresponding parameter is torch.Size([1, 110, 768]), which obviously mismatch.
Thanks again!
Yuanbo
The trick is we trim or interpolate the positional embedding
To use the AudioSet pretrained model, you just need to specify the t_dim
when you initialize the AST model, it is not recommended to do torch.load
by yourself, otherwise you will need to handle positional embedding trimming by yourself.
In our ESC-50 recipe, we show an example to fine-tune AST model pretrained on 10s audio with 5s audios.
-Yuan
Hi there,
I want to use the pre-trained AST you provided for audio tagging on a one-second audio clip, and I follow the feature extraction method you used and pad it to 1024 frames according to the method you provided. https://github.com/YuanGongND/ast/blob/70c675ef682ba392e514962defa456a8e909d0da/egs/audioset/inference.py#L76 Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.
At the same time, I used the pretrained CNN-based PANNs, which I guess you are familiar with it, to predict these short audio clips, and it turned out that the results of PANN are much more accurate than those predicted by AST.
Do you have any suggestions for AST to predict audio events with one-second length?
The audio clips I want to predict is here: https://urban-soundscapes.s3.eu-central-1.wasabisys.com/soundscapes/index.html If you are interested, I am happy to share with you the results I predicted with AST and PANN respectively, and I hope to discuss them further.
Best, Yuanbo