clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.03k stars 272 forks source link

Discussions for training / VoxSRC #55

Closed joonson closed 3 years ago

joonson commented 4 years ago

For example, this configuration gives 1.98% EER using the standard train and test lists. I believe that many of you have trained better models using this trainer. I would appreciate if you are able to share your knowledge!

Shane-pe commented 4 years ago

@joonson Thanks for your share. May I ask a simple question: how to change the setting of --n_mels and --log_input?

joonson commented 4 years ago

It has been added as the input to the trainer in a recent update.

zeek-han commented 4 years ago

For augmentation, is there any other good method rather than standard KALDI recipe??

Thank you...

joonson commented 4 years ago

I have not tried any kind of augmentation. These are the best results I have without using augmentation.

zh794390558 commented 4 years ago

When using large batch size, how long to train with 500 epochs, since the pipeline is slow.

joonson commented 4 years ago

@zh794390558 it should take around 2-3 days using a V100 / Titan RTX GPUs for the attached configuration.

zeek-han commented 4 years ago

I use 2 Titan RTX, but take much more than 2~3 days. how many Titan RTX do you use?

moreover, what is the best min DCF you know? I have just little information about min DCF....please let me know the information about the min DCF.

Thank you...

joonson commented 4 years ago

@zeek-han can you provide command line outputs for the first few epochs? In particular, what speed (Hz) do you see using which loss? I will provide answers for the rest soon.

zh794390558 commented 4 years ago

@joonson I almost using 3h for one epoch, how to accelerate pipeline using DataLoader?

joonson commented 4 years ago

@zh794390558 One epoch should take around 5-10 minutes using a modern GPU. What is your CPU and GPU utilization when you run the script using the recommended settings above?

zh794390558 commented 4 years ago

I using multi-gpu training, the GPU utilization almost zero. But using voxceleb_unsupervised pipeline, the GPU utilization look good.

joonson commented 4 years ago

@zh794390558 I am aware of many bugs with multi-gpu training in Pytorch which I cannot help resolve. The trainer can get good results using single GPU -- I suggest you try without multi-gpu first. I have been able to get similar results even using single GPUs with less memory such as NVIDIA M40/P40/Titan RTX. You just need to reduce the batch size slightly.

Shane-pe commented 4 years ago

@joonson I almost using 3h for one epoch, how to accelerate pipeline using DataLoader?

@zh794390558 Where is your training dataset stored? External hard disk? I copied my training dataset from external hard disk to system SSD disk, then the train time per epoch is reduced from 2h to several minutes. This is my experience, hope can help you.

zh794390558 commented 4 years ago

@Shane-pe my data is on nfs, i will try this.

009deep commented 4 years ago

@zeek-han @Shane-pe for multi-gpu are you using this PR? If so, I'll be interested in knowing any problems you run into as far gpu usage/performance go. IO've been using it since beginning and have not run into any issue so far.

joonson commented 4 years ago

I have heard that multi-GPU training has a large negative impact for metric-learning based methods (e.g. angleproto, ge2e, etc.) but has not much effect on softmax-based losses. Does anyone have the experience to verify or falsify this claim?

zeek-han commented 4 years ago

@joonson Thank you for response. My speed (Hz) with softmaxproto is from 190 Hz to 400Hz... Thank you.

Moreover, could you provide the information about EER and min DCF when using VoxCeleb1(E) and VoxCeleb1(H) as a testset with the GOOD CONFIG? thank you always...

joonson commented 4 years ago
Hi @zeek-han, Here's the results. For the speed, I get around 600Hz in training using 1 GPU. This is using the configuration attached above and --eval_frames 350. Model EER MinDCF
Vox1 1.93 0.152
Vox1E 2.17 0.151
Vox1H 3.99 0.252
zh794390558 commented 4 years ago

@Shane-pe what is the model and batch size do you using? I upload data to SSD disk, one epoch need 1.3h.

Shane-pe commented 4 years ago

@zeek-han @Shane-pe for multi-gpu are you using this PR? If so, I'll be interested in knowing any problems you run into as far gpu usage/performance go. IO've been using it since beginning and have not run into any issue so far.

@009deep I used single GPU.

Shane-pe commented 4 years ago

@Shane-pe what is the model and batch size do you using? I upload data to SSD disk, one epoch need 1.3h.

@zh794390558 Model: ResNet34SEL(Fast ResNet34) Batch Size: 350

Shane-pe commented 4 years ago

@Shane-pe what is the model and batch size do you using? I upload data to SSD disk, one epoch need 1.3h.

@zh794390558 Can you upload your scores.txt? This will be helpful for others to check your settings.

ShaneRun commented 4 years ago
* Changing `--n_mels` from 40 to 64 leads to a small increase in performance.

* Using `--log_input` also leads to a small increase in performance.

* Combining two loss functions (e.g. `angleproto` and `softmax`) sometimes has positive effect. This should be defined as a new loss function that returns the sum of two losses in the `loss` directory.

* Zero padding of the input causes to a significant adverse effect on performance. When there is a large variation in the length of input audio files (e.g. VoxSRC), I recommend `--eval_frames 0` which uses whatever length of audio is available without padding or cropping.

For example, this configuration gives 1.98% EER using the standard train and test lists. I believe that many of you have trained better models using this trainer. I would appreciate if you are able to share your knowledge!

For training Thin ResNet, I only need to change the model type from ResNetSE34L to ResNet34SE? And all other settings remain the same? For example, --log_input True

zh794390558 commented 4 years ago

@Shane-pe Sorry, I can not provde score.txt file. But when I using large batchsize it will spend 1600s for that and only one example in queue. batch process time with one thread: 1649.7700481414795 s, batch size: 1000 512.60 Hz Q:(0/100) D: 23.55 Hz

zeek-han commented 4 years ago

@joonson Thank you for your answers.

I have a question. When you introduce this ISSUE, you said -Combining two loss functions (e.g. angleproto and softmax) sometimes has positive effect. This should be defined as a new loss function that returns the sum of two losses in the loss directory.

why do you say "sometimes"? "sometimes" word makes me think the result is unstable when I run the code with your GOOD config with 1 GPU, Tesla V100. That means, when I train once, I might get the expected result, but nobody guarantees that. Do I understand what you say?

Thank you always..

ShaneRun commented 4 years ago

Hi @zeek-han, Here's the results. For the speed, I get around 600Hz in training using 1 GPU. This is using the configuration attached above and --eval_frames 350. Model EER MinDCF Vox1 1.93 0.152 Vox1E 2.17 0.151 Vox1H 3.99 0.252

@joonson Hi Mr. Chung I got negative scores for all testing pairs, and sometimes maybe smaller than -1, please see attached for more details. What is the problem? Am I doing something wrong? I noticed there are issues on this(#15 and #49 ), but I found no good explanation on it.

Is it okay to use all negative scores as input to the validation toolkit of VoxSRC2020?

Or this is a bug of the trainer?

I am confused, looking forward to your reply, thanks.

scores_output_on_veri.txt

joonson commented 4 years ago

The trainer outputs negative numbers because they are negative Euclidean distances. It is okay to have these scores for validation and test set of VoxSRC 2020.

Shane-pe commented 4 years ago

The trainer outputs negative numbers because they are negative Euclidean distances. It is okay to have these scores for validation and test set of VoxSRC 2020.

Hi @joonson I noticed the score boundary (decide the pairwise is from same speaker or not) seemed to be -1, am I correct? Can you describe a bit more detailed on using negative Euclidean distances and how to design the score boundary? Thanks in advance.

joonson commented 4 years ago

Well, the decision boundary is something that you decide, e.g. the threshold at EER. The distance function should not affect how you decide on the boundary -- the performance should be the same if you replace the negative distance with cosine similarity, though the scores will be different.

Shane-pe commented 3 years ago

@joonson Just change the loss function from angleproto to softmaxproto, and after running for 500 epoches, the results are quite different:

My question is, can the Top1 error increase so largely (from 79 to 97), while the VEER reamin similar?

joonson commented 3 years ago

@joonson Just change the loss function from angleproto to softmaxproto, and after running for 500 epoches, the results are quite different:

  • angleproto IT 500, LR 0.000081, TEER/TAcc 79.08, TLOSS 0.806571, VEER 2.1633
  • softmaxproto IT 500, LR 0.000081, TEER/TAcc 97.63, TLOSS 0.885869, VEER 2.2269

My question is, can the Top1 error increase so largely (from 79 to 97), while the VEER reamin similar?

This is because the accuracy represents different things in softmaxproto and angleproto. In the former its the softmax classification accuracy, whereas in the latter it's the within batch matching accuracy.

yy835055664 commented 3 years ago
  • Using --log_input also leads to a small increase in performance.

Hello, Joonson. Thank you for your ideas. For——log_input features, what is the principle of this method? How to improve performance. Hope you can reply Thank you

ukemamaster commented 3 years ago

@zh794390558 Did you solve your problem of slow training? I am having the same problem, one epoch takes almost 3 hours (sometimes more than that) on 8 Tesla T4 GPUs using distributed training.

But my case is a little different, explained here in detail.

If you have solved your problem, could you please share your solution?