clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.03k stars 272 forks source link

Any recommendations on implementing TNDD using voxceleb_trainer? #87

Closed ShaneRun closed 3 years ago

ShaneRun commented 3 years ago

@joonson Thank you so much for your contribution on this open source work, it really helps me a lot. As is known to all, TNDD is commonly implemented in Kaldi. If I want to implement it in PyTorch based on this trainer, Do you think it is doable? And if yes, can you give me some recommendations on how to work it out? Many thanks in advance.

joonson commented 3 years ago

TDNNs can be represented as 1d convolutions with dilation. Here is my implementation of the x-vectors that I have not tested, but you can try it out. XV.py.zip

You could also try a more complex TDNN architecture at #86

ShaneRun commented 3 years ago

@joonson Thank you for your implementation! I will test it out tomorrow and let you know the result.

Shane-pe commented 3 years ago

@joonson Is it possible to add i-vector to your trainer? In that case, this trainer will be more powerful and generalized, which we can call it all-in-one, because i-vector, x-vector (TNDD), and r-vector (ResNet) are also included. And I believe this trainer can be more and more widely used in the future if more and more people use it as a generalied trainer for supervised learning. This is my advice on all-in-one, please consider, thanks!

ShaneRun commented 3 years ago

@joonson This is the training command: python ./trainSpeakerNet.py --model XV --log_input True --encoder_type SAP --trainfunc angleproto --save_path exps/exp1 --nPerSpeaker 2 --batch_size 400

This is the test scores using XV.py you provided for 20 epochs: IT 1, TEER/TAcc 4.25, TLOSS 4.927236 IT 2, TEER/TAcc 7.30, TLOSS 4.539513 IT 3, TEER/TAcc 8.90, TLOSS 4.386307 IT 4, TEER/TAcc 10.19, TLOSS 4.274195 IT 5, TEER/TAcc 11.21, TLOSS 4.189059 IT 6, TEER/TAcc 12.07, TLOSS 4.128410 IT 7, TEER/TAcc 12.83, TLOSS 4.074492 IT 8, TEER/TAcc 13.54, TLOSS 4.023191 IT 9, TEER/TAcc 14.13, TLOSS 3.980177 IT 10, VEER 11.6066 IT 10, TEER/TAcc 14.62, TLOSS 3.947957 IT 11, TEER/TAcc 15.08, TLOSS 3.914889 IT 12, TEER/TAcc 15.39, TLOSS 3.891497 IT 13, TEER/TAcc 15.73, TLOSS 3.867836 IT 14, TEER/TAcc 16.02, TLOSS 3.849706 IT 15, TEER/TAcc 16.27, TLOSS 3.833727 IT 16, TEER/TAcc 16.56, TLOSS 3.809935 IT 17, TEER/TAcc 16.79, TLOSS 3.795477 IT 18, TEER/TAcc 17.06, TLOSS 3.779800 IT 19, TEER/TAcc 17.30, TLOSS 3.765016 IT 20, VEER 9.6288 IT 20, TEER/TAcc 17.55, TLOSS 3.748320

As comparison, I also attached the results I had done before with the same configuration:(the printed information was slightly different because it is not the latest version of trunk) with training command python ./trainSpeakerNet.py --model ResNetSE34L --log_input True --encoder_type SAP --trainfunc softmaxproto --save_path exps/exp2 --nPerSpeaker 2 --batch_size 400 IT 1, LR 0.001000, TEER/TAcc 1.41, TLOSS 11.942019 IT 2, LR 0.001000, TEER/TAcc 7.55, TLOSS 9.519187 IT 3, LR 0.001000, TEER/TAcc 16.41, TLOSS 8.233636 IT 4, LR 0.001000, TEER/TAcc 25.15, TLOSS 7.317389 IT 5, LR 0.001000, TEER/TAcc 32.96, TLOSS 6.619958 IT 6, LR 0.001000, TEER/TAcc 39.75, TLOSS 6.066701 IT 7, LR 0.001000, TEER/TAcc 45.47, TLOSS 5.614951 IT 8, LR 0.001000, TEER/TAcc 50.47, TLOSS 5.230480 IT 9, LR 0.001000, TEER/TAcc 54.69, TLOSS 4.906192 IT 10, LR 0.001000, TEER/TAcc 58.26, TLOSS 4.636681, VEER 6.5695 IT 11, LR 0.000950, TEER/TAcc 61.40, TLOSS 4.403981 IT 12, LR 0.000950, TEER/TAcc 63.75, TLOSS 4.215954 IT 13, LR 0.000950, TEER/TAcc 65.83, TLOSS 4.055617 IT 14, LR 0.000950, TEER/TAcc 67.58, TLOSS 3.913849 IT 15, LR 0.000950, TEER/TAcc 69.32, TLOSS 3.782728 IT 16, LR 0.000950, TEER/TAcc 70.68, TLOSS 3.672339 IT 17, LR 0.000950, TEER/TAcc 72.06, TLOSS 3.564854 IT 18, LR 0.000950, TEER/TAcc 73.17, TLOSS 3.477508 IT 19, LR 0.000950, TEER/TAcc 74.26, TLOSS 3.384924 IT 20, LR 0.000950, TEER/TAcc 75.30, TLOSS 3.300793, VEER 5.6151

Compared with ResNetSE34L model, XV model has the following characteristics: (1) 3x longer time to train for each epoch (2) Significantly lower TAcc (3) Quite significantly lower VEER Is it okay based on your experience of this XV model you provided? Thank you so much!

forwiat commented 3 years ago

hi @ShaneRun , I see your experiments based on ResNetSE34L, ResNetSE34V2 and XV. I am running ResNetSE34V2, I wonder how low can EER reach. Could you share your experiments' results and some experiments' config? It will help me so much. wish you feedback. Thank you so much!

ShaneRun commented 3 years ago

@forwiat I just run ResNetSE34L. And I used the recommended setting of the trainer described in README.txt, such as: --model ResNetSE34L --n_mels 40 --log_input True --encoder SAP --trainfunc softmaxproto

ShaneRun commented 3 years ago

@forwiat Moreover, XV is just a test and I don't want to move forward. the configuration is: python ./trainSpeakerNet.py --model XV --log_input True --encoder_type SAP --trainfunc angleproto --save_path exps/exp1 --nPerSpeaker 2 --batch_size 400