Question Regarding Activation on MLP Head

YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BSD 3-Clause "New" or "Revised" License

1.06k stars 202 forks source link

Hello,

Thanks for creating such an interesting paper! This adaptation of an off-the-shelf Vision Transformer to a spectrogram transformer is truly fascinating, backed by great results!

I was wondering about why was simply a linear (or no) activation chosen for the MLP head of the AST model? Seems like there's only a linear operation.

Noting that most multi-class algorithms for CV end in a softmax activation, it would be interesting to know your rationale for the choice.

I also noticed that in the paper, it is mentioned that " A linear layer with sigmoid activation maps the audio spectrogram representation to labels for classification." However, there is no sigmoid activation in the MLP head portion of the ast_models.py file.

It would be greatly appreciated if you could clarify on the above questions.

Thanks, Arsh

YuanGongND / ast

Question Regarding Activation on MLP Head #61