YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

demo for testing the single audiofile with the trained model #18

Closed joewale closed 2 years ago

joewale commented 2 years ago

Hi, Yuan. Is there the code or the demo to test the single audiofile with the trained model ?

YuanGongND commented 2 years ago

Hi there,

Sorry but I am quite busy these days and won't be able to do that soon. But I think it should be quite straightforward to implement. It is of course welcome if you could create one and make a pull request.

-Yuan

joewale commented 2 years ago

I see. I am writing the demo for single audiofile. I will make a pull request later. Hi, Yuan, for the training and validation stage, only the target_length of the feature frames of every audiofile is input to the network, no matter how long the duration of audiofile is ?

YuanGongND commented 2 years ago

Hi there,

@JeffC0628 just submit a pull request, maybe you can take a look?

I am not sure if I understand your question, in the training and validation stage, we do not input the target_length to the network (see here), instead, we initialize the network with a fix target_length, and cut/pad the audio into that length.

-Yuan

joewale commented 2 years ago

got it.

Hi,Yuan,I find the delay of prediciton for audiofile is large on cpu. It costs 14 seconds for the 3 minutes audiofile.

YuanGongND commented 2 years ago

Hi there,

AST is not supposed to work with 3-minute audio files. You need to split into smaller chunks (e.g., 10s with some overlap), it should be reasonably fast for CPU inference, but of course it is better on GPUs.

-Yuan

joewale commented 2 years ago

I see, Thanks