YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

Audio length 1s #87

Closed 9B8DY6 closed 1 year ago

9B8DY6 commented 1 year ago

Is it okay to extract audio feature whose length is 1~3s? Its fbank shape is (139, 128)....^^. It means that n_frames = 139.

YuanGongND commented 1 year ago

Yes, the AST model certainly supports 1s as our SpeechCommands recipe runs on 1s audios and achieves state-of-the-art performance.

If you use our training pipeline, you just need to change https://github.com/YuanGongND/ast/blob/97e57e7852809c6bc825c87a59f07635138cc43d/egs/speechcommands/run_sc.sh#L34 to your desired n_frames, in your case, 139. For optimal performance, you might want to set https://github.com/YuanGongND/ast/blob/97e57e7852809c6bc825c87a59f07635138cc43d/egs/speechcommands/run_sc.sh#L21 as True and do other tuning, please check the readme file for details.

-Yuan

9B8DY6 commented 1 year ago

@YuanGongND If audio length is much shorter than 10s like 1s~3s, do i have to pretrain ast from scratch? I just want to use pretrained ast model to extract audio tokens.

YuanGongND commented 1 year ago

In my experience, audioset pretraining does not hurt the performance in almost all cases, so you can certainly have a try to set audiosetpretrain=True and imagenetpretrain=True like we did for the ESC-50 recipe. You can use AudioSet pretraining no matter your target audio length is shorter or longer than 10s, we adapt the positional embedding to fit the length internally in the model https://github.com/YuanGongND/ast/blob/5f50e009591748169172342303055bf88c282b8d/src/models/ast_models.py#L143-L147. A finetuning stage is crucial for AST to achieve the optimal solution.

If you want to freeze the AST model and get the feature, it might be better to pad your input to 10s, I would suggest to use this inferernce script: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb, but instead of getting the last layer output (prediction logits), get the penultimate layer output as the feature. It should be a relatively easy modification and you don't need to worry about your input. The script loads audio and pad it to 10s.

-Yuan

9B8DY6 commented 1 year ago

How about your pretrained model? Your pretrained model also works well in short audio?

YuanGongND commented 1 year ago

By audiosetpretrain=True, I meant our pretrained model.

For end-to-end fine-tuning, it works well for shorter audios, please see our ESC-50 recipe.

For freeze and feature extraction, I think padding it to longer audio is the best choice, please check the colab script.