Closed 9B8DY6 closed 1 year ago
Yes, the AST model certainly supports 1s as our SpeechCommands recipe runs on 1s audios and achieves state-of-the-art performance.
If you use our training pipeline, you just need to change https://github.com/YuanGongND/ast/blob/97e57e7852809c6bc825c87a59f07635138cc43d/egs/speechcommands/run_sc.sh#L34 to your desired n_frames
, in your case, 139. For optimal performance, you might want to set https://github.com/YuanGongND/ast/blob/97e57e7852809c6bc825c87a59f07635138cc43d/egs/speechcommands/run_sc.sh#L21 as True
and do other tuning, please check the readme file for details.
-Yuan
@YuanGongND If audio length is much shorter than 10s like 1s~3s, do i have to pretrain ast from scratch? I just want to use pretrained ast model to extract audio tokens.
In my experience, audioset pretraining does not hurt the performance in almost all cases, so you can certainly have a try to set audiosetpretrain=True
and imagenetpretrain=True
like we did for the ESC-50 recipe. You can use AudioSet pretraining no matter your target audio length is shorter or longer than 10s, we adapt the positional embedding to fit the length internally in the model https://github.com/YuanGongND/ast/blob/5f50e009591748169172342303055bf88c282b8d/src/models/ast_models.py#L143-L147. A finetuning stage is crucial for AST to achieve the optimal solution.
If you want to freeze the AST model and get the feature, it might be better to pad your input to 10s, I would suggest to use this inferernce script: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb, but instead of getting the last layer output (prediction logits), get the penultimate layer output as the feature. It should be a relatively easy modification and you don't need to worry about your input. The script loads audio and pad it to 10s.
-Yuan
How about your pretrained model? Your pretrained model also works well in short audio?
By audiosetpretrain=True
, I meant our pretrained model.
For end-to-end fine-tuning, it works well for shorter audios, please see our ESC-50 recipe.
For freeze and feature extraction, I think padding it to longer audio is the best choice, please check the colab script.
Is it okay to extract audio feature whose length is 1~3s? Its fbank shape is (139, 128)....^^. It means that n_frames = 139.