Janghyun1230 / Speaker_Verification

Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"
MIT License
354 stars 102 forks source link

discrepancy in feature extraction process #9

Closed abishek1062 closed 5 years ago

abishek1062 commented 5 years ago

Why have you not extracted the 40 dimensional mel filterbank for TDSV? You however have extracted these features for TISV.

I am referring to the two functions in data_preprocess.py namely save_spectrogram_tdsv and save_spectrogram_tisv

You have not used the librosa.filters.mel function in save_spectrogram_tdsv. Can you please elaborate why you did so?

abishek1062 commented 5 years ago

Can you please elaborate on why you used the same audio for enrollment and verification? The only difference between the enrollment and verification utterances is the noise you applied on them. In the strictest sense, when verifying two utterances, they have to be two utterances separately spoken and not the same utterance being compared to itself. Kindly take a look at the line 133 and 134 in model.py S = sess.run(similarity_matrix, feed_dict={enroll:random_batch(shuffle=False, noise_filenum=1), verif:random_batch(shuffle=False, noise_filenum=2)})

Almost the same parameters are being passed the function random_batch which returns the utter_batch array evaluated from the same audio content. Thus the same utterances are used in the enroll and verif keys in the feed_dict dictionary

Janghyun1230 commented 5 years ago

Hi, @abishek1062 I think I can answer both questions at once. For TD-SV I could not obtain proper dataset (which contains many fixed text utterances from each speaker). Thus, I tried to add noises with the utterance to perform metric learning (by this I can obtain different utterances of each speaker with fixed text).

However, adding noise to the mel-spectrogram is unsense, because it is log-scale. Thus, we should add noise to the complex-spectrogram of the given utterance! (see 49-58 lines of utils.py)