YuanGongND / ssast

Code for the AAAI 2022 paper "SSAST: Self-Supervised Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
362 stars 58 forks source link

Model fails to converge on transfer to audio backtesting problem #19

Open yangma12 opened 1 year ago

yangma12 commented 1 year ago

Dear Yuan and authors, First of all, thank you for your paper. Recently, I migrated your pre-trained model to the regression prediction task of personality computing. After splicing several fully connected layers after your original model, the result is that the predicted value will only be maintained at a very low level during training. In a small interval, there will be no effective changes. Have you done relevant regression experiments? What are the possible reasons for this problem? Sorry to bother you with my question and thank you very much for reading my question

yang

YuanGongND commented 1 year ago

hi there,

Do you mean you finetune our pretrained model for a regression task?

What do you by this?

After splicing several fully connected layers after your original model

-Yuan

yangma12 commented 1 year ago

thank you for your reply!I mainly use this data set for fine-tuning, and separate the audio of this data set(https://chalearnlap.cvc.uab.cat/dataset/24/description/). Each audio is a 15-second speech audio, and the MLP is stitched after the model to adjust the dimension of the audio data output by the final model to ( batchsize,5), 5 corresponds to the regression value of five personality traits corresponding to an audio.

yangma12 commented 1 year ago

In the experiment, I tried to adjust the learning rate and other parameters, tried to remove the mask and mixing in the data preprocessing, set the input_tdim to 1530 to suit my audio length, label_dim to 512, and finally performed regression prediction through the following code : nn.Sequential( nn.Linear(in_features=512, out_features=256), nn.ReLU(inplace=True), nn.Linear(in_features=256, out_features=128), nn.ReLU(inplace=True), nn.Linear(in_features=128, out_features=6), nn. Sigmoid() ),Forgive me for not being deep enough in deep learning at the moment, I'm not sure where the problem might be.

YuanGongND commented 1 year ago

There are a few things:

  1. First, it seems a multi-modal, speech-dominated dataset. So you might want to try an audio-visual model or speech-based model (e.g., Hubert), according to my experience, for pure speech task, pure speech models are better, can you see the Table 5 of SSAST Paper? For audio-visual models, we have CAV-MAE for general audio-visual model, but again, you might need a model focusing on face.

  2. For this

nn.Sequential( nn.Linear(in_features=512, out_features=256), nn.ReLU(inplace=True), nn.Linear(in_features=256, out_features=128), nn.ReLU(inplace=True), nn.Linear(in_features=128, out_features=6), nn. Sigmoid() )

Is Sigmoid common for regression? Setting "label_dim to 512" (for classification) and then a few dense layers seems to be redundent. You can just change the last MLP layer to a regression head.

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/models/ast_models.py#L166-L167

But I know very little about your task. You need to tune the params by yourself. For some networks, we use a larger learning rate for the mlp layer because it is random initialized while other parameters are pretrained.

I mainly answer questions that are related to what we presented in the paper, and it is hard for me to answer questions regarding new task / usage of the model.

-Yuan

YuanGongND commented 1 year ago

Another minor point is that you said there are 5 regression values, but nn.Linear(in_features=128, out_features=6) shows 6.