Merging the code to wespeaker

wsstriving commented 2 months ago

Thank you for the excellent work! I would like to ask if you would mind if we adapt this code into the official WeSpeaker models. We will definitely include the original paper link, authorship, etc. I just want to check whether you are okay with the open-source license of WeSpeaker.

Best regards, Shuai

vanIvan commented 2 months ago

@wsstriving Sure, we'll be happy, if you provide working way to train it. I'm sharing archive with configs and training logs for all model sizes. The configs might be different a bit from the ones that are used in wespeaker, but main hyperparams have same structure. Also few models were trained with AAMSoftmax instead of SphereFace2, they will have better quality, if they retrained with SphereFace2. redimnet_vox2_configs.tar.gz

wsstriving commented 1 month ago

Hi, I would like to ask whether the configs you provided here are the ones used in the paper, because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4).

Other questions:

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?
Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

vanIvan commented 1 month ago

Hi, @wsstriving thank you for your attempt on model retraining!

Answering your questions:

because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4)

There might be small difference for b1+ model sizes, 100-200k parameters +/- due to different num of parameters calculation. Also could you please clarify what is the size difference for models? Might there be difference for model with class head and without - for pretraining it should have had additional 192 x 5994 parameters +1.15M? In our paper we reported size without classification head and features (as they are frozen during training / inference)

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?

For all of the model sizes we found that SphereFace works better and advise to switch to it by default (b2 config should be default for all models). Most of the models were trained with same set of hyperparams: weigth decay / max LR / schedulers setup - which were copied from one of resnet configs from original wespeaker pipeline. This is true for all models sizes except probably the smallest B0 and largest B6 redimnets - there we migth sligthly changed weigth decay.

Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

Actually we don't, it should work without any changes. Could you share configs you are using for LM and metrics you are getting?

wsstriving commented 1 month ago

I've created a draft pull request for wespeaker (https://github.com/wenet-e2e/wespeaker/pull/346/files) that you can have a check. Basically, I've adapted your code to align with the wespeaker style and removed the preprocessing part (feature extraction) to use wespeaker's existing implementation. You can find the default configurations for B0-B6 models in the model.py file, along with a comparison of model sizes.

Unfortunately, I don't have the resources to run all the experiments right now. However, I can share some preliminary results for the B2 model with the arc_margin loss. I initially had global_context_att set to False (different from your setup)

Before LM (no score norm): O: 0.744 E: 0.932 H: 1.761 After LM （no score norm）: O: 0.712 E: 0.894 H: 1.621

vanIvan commented 1 month ago

Thank you for sharing, there is some mismatch in features setup:

We are using larger than 10ms hop size, 15ms - doesn't degrade quality much, but makes max batch size larger that can fit on GPU, the features setup in our configs are dummy (the ones that are from wespeaker originally), cause we are using features integrated into model - we forgot to remove them.

I found our internal results for ReDimNet-B2 LM model trained with AAM loss with global_context_att set to True:

After LM （no score norm): O: 0.675 E: 0.826 H: 1.457

There might be some improvement when setting global_context_att to True.

The best results (matching ours) you should get by using for all models:

SphereFace2 loss
ASTP pooling with global_context_att=True
Hop length: best results are achieved using 10ms, yet the improvement from 15ms is very small, and training is significant faster with 15ms features - so we would advice to use them.

wsstriving commented 1 month ago

Hi, @vanIvan we have merged the initial version into wespeaker https://github.com/wenet-e2e/wespeaker/pull/346, but still there is some performance gap, it will be great if you could try the current implementation and give some suggestions! BTW, if you will be at Interspeech, looking forward to talking with you face to face.

vanIvan commented 1 month ago

Hi, @wsstriving, thank you for integration, we'll try to look at it soon. Yes, me and few of my colleagues from our team are going to attend Interspeech and present ReDimNet there, would be nice to meet there, let's keep in touch!

vanIvan commented 3 weeks ago

Hello, @wsstriving! I have realized, that there is no variable weight decay for projection head separately from backbone neural network in wespeaker pipeline - there is currently only one weight_decay used for whole network. So I've added this variable weight_decay for projection head in forked wespeaker pipeline, could you please check it and if you have time, probably retrain model to check whether it improves results (especially for SF2 loss). I also made model more lightweight during training, by increasing hop length of melbanks in it's config - now it should train faster, and one could fit bigger batch on same GPU setup.

IDRnD / ReDimNet

Merging the code to wespeaker #3