YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

ImageNet classifier is not terminated in Audioset pretrained models. #34

Closed saghiralfasly closed 2 years ago

saghiralfasly commented 2 years ago

Hi Yuan Gong, Thank you for sharing your work. It is clear and easy to run. I am wondering about the ImageNet Classifier weights, they still exist in AudioSet pretrained models. do you train them?. here is the last displayed part of the pretrained "audioset_10_10_0.4593.pth"

module.v.head.weight torch.Size([1000, 768]) module.v.head.bias torch.Size([1000]) module.v.head_dist.weight torch.Size([1000, 768]) module.v.head_dist.bias torch.Size([1000]) module.mlp_head.0.weight torch.Size([768]) module.mlp_head.0.bias torch.Size([768]) module.mlp_head.1.weight torch.Size([527, 768]) module.mlp_head.1.bias torch.Size([527])

They can be skipped by self.v.head = nn.Identity() self.v.head_dist = nn.Identity()

Now, I want to use the pretrained Audioset model for another task. but worried if I eliminate this part will affect the performance. Although, I think they are not connected to the final Audioset classifier of 527 classes.

Thank you again

YuanGongND commented 2 years ago

Hi there,

You are very correct on this point - we use the same solution for our later models. You can safely eliminate anything in self.v.head (but not self.mlp_head).

Finetuning the AudioSet pretrained model for new tasks is totally possible and we recommend trying it for any audio/speech task, please read the readme introduction on how to do this.

-Yuan

YuanGongND commented 2 years ago

Eliminating the ImageNet classifier can make the model slightly smaller, but performance-wise it is the same.