YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Poor performance of AST on audio clips with different lengths #60

Open Yuanbo2020 opened 2 years ago

Yuanbo2020 commented 2 years ago

Hi there,

I want to use the pre-trained AST you provided for audio tagging on a one-second audio clip, and I follow the feature extraction method you used and pad it to 1024 frames according to the method you provided. https://github.com/YuanGongND/ast/blob/70c675ef682ba392e514962defa456a8e909d0da/egs/audioset/inference.py#L76 Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

At the same time, I used the pretrained CNN-based PANNs, which I guess you are familiar with it, to predict these short audio clips, and it turned out that the results of PANN are much more accurate than those predicted by AST.

Do you have any suggestions for AST to predict audio events with one-second length?

The audio clips I want to predict is here: https://urban-soundscapes.s3.eu-central-1.wasabisys.com/soundscapes/index.html If you are interested, I am happy to share with you the results I predicted with AST and PANN respectively, and I hope to discuss them further.

Best, Yuanbo

YuanGongND commented 2 years ago

Hi Yuanbo,

  1. Instead of padding the audio to 1024 frames, it might worthwhile to try to instantiate the AST model with t_dim=100 for your 1-second audio.

  2. If you have some training data, I think you can try to fine-tune the AST model.

  3. When you say

Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

Do you mean you view the class with the largest logit as the prediction? AST model is trained with BCE loss so the output logits are not normalized for all classes.

-Yuan

Yuanbo2020 commented 2 years ago

Hi Yuan,

Thank you so much for replying.

I am a little bit confused, could you please tell me which trained AST model can receive t_dim=100?

Because when loading the parameters of the trained AST, the dimension of pos_embed is torch.Size([1, 1214, 768]), and if t_dim is set to 100, the corresponding parameter is torch.Size([1, 110, 768]), which obviously mismatch.

Thanks again!

Yuanbo

YuanGongND commented 2 years ago

The trick is we trim or interpolate the positional embedding

https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/models/ast_models.py#L141-L147

To use the AudioSet pretrained model, you just need to specify the t_dim when you initialize the AST model, it is not recommended to do torch.load by yourself, otherwise you will need to handle positional embedding trimming by yourself.

In our ESC-50 recipe, we show an example to fine-tune AST model pretrained on 10s audio with 5s audios.

-Yuan