YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Question Regarding Activation on MLP Head #61

Closed arshinmar closed 2 years ago

arshinmar commented 2 years ago

Hello,

Thanks for creating such an interesting paper! This adaptation of an off-the-shelf Vision Transformer to a spectrogram transformer is truly fascinating, backed by great results!

I was wondering about why was simply a linear (or no) activation chosen for the MLP head of the AST model? Seems like there's only a linear operation.

Noting that most multi-class algorithms for CV end in a softmax activation, it would be interesting to know your rationale for the choice.

I also noticed that in the paper, it is mentioned that " A linear layer with sigmoid activation maps the audio spectrogram representation to labels for classification." However, there is no sigmoid activation in the MLP head portion of the ast_models.py file.

It would be greatly appreciated if you could clarify on the above questions.

Thanks, Arsh

YuanGongND commented 2 years ago

Hi Arsh,

Thanks for your kind words.

The activation is in the loss function.

-For multi-label classification (more than one label for each audio clip), BCE loss with Sigmoid is used. https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L65

-For multi-class classification, CE loss with softmax is used. https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L71

-Yuan