Why not initializing with pretrained wav2vec2 weights

TideDancer / interspeech21_emotion

99 stars 20 forks source link

Why not initializing with pretrained wav2vec2 weights #9

Closed khalidhuseynov closed 2 years ago

khalidhuseynov commented 2 years ago

First of all thanks for the paper and opening up the code!

The code seems to reinitialize weights for wav2vec2 here and train model from scratch. So I was wondering about author's opinion on why not initialize model with pretrained wav2vec2 weights and train/tune from there. I believe it could be more generalizable to other datasets too and could converge faster. I may try it myself but was wondering if it was tried/tested before

khalidhuseynov commented 2 years ago

i've checked a bit more closely transformers library and seems like weights are overriden afterwards inside of from_pretrained function. Then you can ignore my question above. I had couple of follow up questions though

if initializing with wav2vec2-base-960h weights, were results any better? i didn't see anything related to this in the paper
did you have any effect if freezing feature_extractor part during training?

Thanks in advance!

TideDancer commented 2 years ago

Thank you for your interests in this work.

I didn't try wav2vec2-base-960h. This is finetuned on ASR data so perhaps it could help on our ASR task. But as emotion classification is the main goal, I am not sure if it may hurt (as it is more towards the ASR side). Please let me know if you find anything interesting if using the base-960h.
I do tested freezing feature_extractor, no significant difference is observed no matter freezing or not.

khalidhuseynov commented 2 years ago

thanks for your response. i've tested wav2vec2-base-960h and it was helpful for model generalizability and had better performance on random speech samples (not from IEMOCAP dataset). However it's not much useful if train/test set is limited to IEMOCAP dataset. i'll close this issue then

owos commented 1 year ago

@khalidhuseynov and @TideDancer , I'm trying to do something different that would involve building a more sophisticated architecture. Do you have any ideas on whether adding more layers before and after the pooling layer at the cls task would improve the performance? Also, do you think using several alpha values between 0.1 and 0.01 might improve performance?

TideDancer commented 1 year ago

@khalidhuseynov and @TideDancer , I'm trying to do something different that would involve building a more sophisticated architecture. Do you have any ideas on whether adding more layers before and after the pooling layer at the cls task would improve the performance? Also, do you think using several alpha values between 0.1 and 0.01 might improve performance?

@owos , Thanks for your interests. 1. Adding what kind of blocks before and after pooling layer? After the pooling I use a simple FC, which can be improved. Before pooling, I think you can use more fancy transformer structures, e.g. branchformer, for the speech task. 2. Feel free to adjust the alpha values if resource permitted. I didn't test values between them but it worths a try.