Closed arshinmar closed 2 years ago
Hi Arsh,
Thanks for your kind words.
The activation is in the loss function.
-For multi-label classification (more than one label for each audio clip), BCE loss with Sigmoid is used. https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L65
-For multi-class classification, CE loss with softmax is used. https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/traintest.py#L71
-Yuan
Hello,
Thanks for creating such an interesting paper! This adaptation of an off-the-shelf Vision Transformer to a spectrogram transformer is truly fascinating, backed by great results!
I was wondering about why was simply a linear (or no) activation chosen for the MLP head of the AST model? Seems like there's only a linear operation.
Noting that most multi-class algorithms for CV end in a softmax activation, it would be interesting to know your rationale for the choice.
I also noticed that in the paper, it is mentioned that " A linear layer with sigmoid activation maps the audio spectrogram representation to labels for classification." However, there is no sigmoid activation in the MLP head portion of the ast_models.py file.
It would be greatly appreciated if you could clarify on the above questions.
Thanks, Arsh