Closed JeffC0628 closed 2 years ago
Hi Jeff,
Thanks. Which part of the output makes you think it is not correct?
There might be other things - but I noticed that you didn't normalize your fbank, which will likely lead to wrong results. If you use the AudioSet pretrained model, please use:
fbank = (fbank - (-4.2677393)) / (4.5689974 * 2)
before returning fbank.
Also, it is highly encouraged to set audioset_pretrain=True
when initialize the AST model rather than manually load state_dict. It should be fine in your case, but if your target length is not 1024, the first method will automatically adjust the positional embedding for you.
Please let me know if that helps.
-Yuan
that's the point I missed, it's better now, thanks
Hi, Why do you switch torchaudio backend for interence #torchaudio.set_audio_backend("soundfile") ?
@tsw123tsw
I guess it might related to some system package. But it is really not necessary, our official inference sample does not switch the backend. https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb
-Yuan
hi, yuan: I have written a pretty simple script to verify the tags of the single wave, but I got the result it seems not right, could you help to point the mistake?
and the output: Speech: 0.1906 Music: 0.0481 Inside, small room: 0.0245 Musical instrument: 0.0100 Silence: 0.0088 Sound effect: 0.0074 Outside, rural or natural: 0.0064 Animal: 0.0058 Outside, urban or manmade: 0.0045 Inside, large room or hall: 0.0041