YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

Where can I download the imagenet pretrain model ? #3

Closed joewale closed 3 years ago

joewale commented 3 years ago

Hi, YuanGongND, can you share the imagenet pretrain model url ?

YuanGongND commented 3 years ago

Hi there,

To use the ImageNet pretrained model for audio/speech tasks, you just need to set imagenet_pretrain=True when you initialize the AST model, the timm package will automatically download it for you, and my code will adapt it to the audio/speech task (see Section 2.2 of the paper), you don't need to explicitly use the URL.

If you simply want to know the URL of the ImageNet pretrained model, for the base DeiT model we used in the paper, the URL is https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth .

-Yuan

YuanGongND commented 3 years ago

Btw, all AudioSet pretrained model are (Imagenet + AudioSet) pretrained model.

joewale commented 3 years ago

Hi, YuanGongND, thanks to your quick reply. Because the network is unreachable in my machine, I want to download the imagenet pretrained model with base384 model size.

YuanGongND commented 3 years ago

I see, you can download the model use the link I provided and put it in your $TORCH_HOME/hub/checkpoints/deit_base_distilled_patch16_384-d0272ac0.pth. Then when you set imagenet_pretrain=True when you initialize the AST model, the timm package should skip the download process and directly load the model locally.

joewale commented 3 years ago

got it, thanks ! I run the code with my dataset, and the log when loading the pretrained model as follows, is it right ? image

YuanGongND commented 3 years ago

It is correct, and you are using AudioSet pretrained model (which is actually AudioSet+Imagenet pretrained model). I do recommend using this model for all tasks EXCEPT that your dataset is AudioSet itself.

The reason why you see two 'AST Model Summary' two times is that internally the code initializes a model without any pretraining and then loads the AudioSet pretrained model. So it is the expected behavior.

joewale commented 3 years ago

got it , thanks a lot. Is there the code or the demo to test the single audiofile with the trained model ?

YuanGongND commented 3 years ago

There's no such demo yet, but I will add on when I have some time.

joewale commented 3 years ago

ok, It's great! I will have a try. I'm looking forward to your demo. Thanks a lot.