HarryVolek / PyTorch_Speaker_Verification

PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al.
BSD 3-Clause "New" or "Revised" License
578 stars 164 forks source link

Things to check when testing with different DB #55

Open hash2430 opened 5 years ago

hash2430 commented 5 years ago

This might be a silly question so I willl begin with apology. I am new to speaker verification and I am trying to apply this repo for VoxCeleb1. DataLoading and other stuffs seems trivial but I have a question regarding EER calculation.

for thres in [0.01*i+0.5 for i in range(50)]

similarity threshold in this case ranges from 0.5 to 0.99 Does this range need calibration when I am using different DB? It seems from anther repo (DeepSpeaker) that uses VoxCeleb, that range appers to be different. `

Calculate evaluation metrics

thresholds = np.arange(0, 30, 0.01)
tpr, fpr, accuracy = calculate_roc(thresholds, distances,
    labels)
thresholds = np.arange(0, 30, 0.001)
val,  far = calculate_val(thresholds, distances,
    labels, 1e-3)

`

I got 16% EER on VoxCeleb1. Can anybody give any advice on the tuning points I have to adjust? Or is there anyone who have different EER using VoxCeleb1?

BarCodeReader commented 5 years ago

I also have a quite high EER on VOX1, also my GE2E loss is quite high, around 20. I think we can obtain very good result on TIMIT is just because the dataset is simple, 630 people repeating 10 sentences which gives you 6300 utterances. but VOX is 1250 speaker each with 15-30 unique sentences...

have you continue your experiment on the UIS-RNN using the EER 16% model?

hash2430 commented 5 years ago

Thanks for your kind answer. My purpose of training speaker verification is to use it as objective evaluation for speaker mimicking(generating speech of a new person who is unseen during training). So UIS-RNN is out of my interest. That is for speaker diarization, right?

Plus, I obtained 16% by following manner.

  1. Increase epoch to 1800 => 18% EER
  2. Do not use centroids from validation set at test time (instead, only enrollment embeddings to calculate centroids) =>16 % I did not expect second approach would give better EER. I just thought it would make more sense and I cannot explain why it gave better performance.

I might try using this repository trained with TIMIT to evaluate speaker verification on synthesized speech. Since synthesized speeches are high in SNR than VoxCeleb1 and it is more like TIMIT. Thanks :D

BarCodeReader commented 5 years ago

Oh, I see, I only use 300 epoch on the training and the loss decrease very slow and almost stop around 20...actually this is also my question on the GE2E Loss training...how do I know if I trained too much and thus the model is overfitting? for the TIMIT the loss is very low and converge very fast...atually 300 epoch you can already have loss 1.0+

also, one reminder for you...if you also use this model to create d-vector, you need to change the number in the yaml...you can refer to the question i asked in this repo...i think there are some mistake...but for training the LSTM, the yaml file setting is correct.