IDRnD / ReDimNet

The official pytorch implemention of the Intespeech 2024 paper "Reshape Dimensions Network for Speaker Recognition"
MIT License
95 stars 5 forks source link

Merging the code to wespeaker #3

Open wsstriving opened 2 months ago

wsstriving commented 2 months ago

Thank you for the excellent work! I would like to ask if you would mind if we adapt this code into the official WeSpeaker models. We will definitely include the original paper link, authorship, etc. I just want to check whether you are okay with the open-source license of WeSpeaker.

Best regards, Shuai

vanIvan commented 2 months ago

@wsstriving Sure, we'll be happy, if you provide working way to train it. I'm sharing archive with configs and training logs for all model sizes. The configs might be different a bit from the ones that are used in wespeaker, but main hyperparams have same structure. Also few models were trained with AAMSoftmax instead of SphereFace2, they will have better quality, if they retrained with SphereFace2. redimnet_vox2_configs.tar.gz

wsstriving commented 1 month ago

Hi, I would like to ask whether the configs you provided here are the ones used in the paper, because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4).

Other questions:

vanIvan commented 1 month ago

Hi, @wsstriving thank you for your attempt on model retraining!

Answering your questions:

because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4)

  1. There might be small difference for b1+ model sizes, 100-200k parameters +/- due to different num of parameters calculation. Also could you please clarify what is the size difference for models? Might there be difference for model with class head and without - for pretraining it should have had additional 192 x 5994 parameters +1.15M? In our paper we reported size without classification head and features (as they are frozen during training / inference)

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?

  1. For all of the model sizes we found that SphereFace works better and advise to switch to it by default (b2 config should be default for all models). Most of the models were trained with same set of hyperparams: weigth decay / max LR / schedulers setup - which were copied from one of resnet configs from original wespeaker pipeline. This is true for all models sizes except probably the smallest B0 and largest B6 redimnets - there we migth sligthly changed weigth decay.

Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

  1. Actually we don't, it should work without any changes. Could you share configs you are using for LM and metrics you are getting?
wsstriving commented 1 month ago

I've created a draft pull request for wespeaker (https://github.com/wenet-e2e/wespeaker/pull/346/files) that you can have a check. Basically, I've adapted your code to align with the wespeaker style and removed the preprocessing part (feature extraction) to use wespeaker's existing implementation. You can find the default configurations for B0-B6 models in the model.py file, along with a comparison of model sizes.

Unfortunately, I don't have the resources to run all the experiments right now. However, I can share some preliminary results for the B2 model with the arc_margin loss. I initially had global_context_att set to False (different from your setup)

Before LM (no score norm): O: 0.744 E: 0.932 H: 1.761 After LM (no score norm): O: 0.712 E: 0.894 H: 1.621

vanIvan commented 1 month ago

Thank you for sharing, there is some mismatch in features setup:

I found our internal results for ReDimNet-B2 LM model trained with AAM loss with global_context_att set to True:

After LM (no score norm): O: 0.675 E: 0.826 H: 1.457

There might be some improvement when setting global_context_att to True.

The best results (matching ours) you should get by using for all models:

wsstriving commented 1 month ago

Hi, @vanIvan we have merged the initial version into wespeaker https://github.com/wenet-e2e/wespeaker/pull/346, but still there is some performance gap, it will be great if you could try the current implementation and give some suggestions! BTW, if you will be at Interspeech, looking forward to talking with you face to face.

vanIvan commented 1 month ago

Hi, @wsstriving, thank you for integration, we'll try to look at it soon. Yes, me and few of my colleagues from our team are going to attend Interspeech and present ReDimNet there, would be nice to meet there, let's keep in touch!

vanIvan commented 3 weeks ago

Hello, @wsstriving! I have realized, that there is no variable weight decay for projection head separately from backbone neural network in wespeaker pipeline - there is currently only one weight_decay used for whole network. So I've added this variable weight_decay for projection head in forked wespeaker pipeline, could you please check it and if you have time, probably retrain model to check whether it improves results (especially for SF2 loss). I also made model more lightweight during training, by increasing hop length of melbanks in it's config - now it should train faster, and one could fit bigger batch on same GPU setup.