HarryVolek / PyTorch_Speaker_Verification

PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al.
BSD 3-Clause "New" or "Revised" License
575 stars 165 forks source link

Two question on Embeddings #31

Closed mazzzystar closed 5 years ago

mazzzystar commented 5 years ago

About the align_embeddings

After calculating all the Embeddings window-by-window, you then calculate the average Embedding with align_embeddings function. But why here you hardcode the threshold, and what do they mean ? https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/dvector_create.py#L60-L67

In the original paper, it says "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise average".

About the embedding frame.

In your implementation of data_preprocess.py, you only save the first N frame and the last N frame as below to .npy file, which may drop a lot of data if current utterance consists long frames: https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/data_preprocess.py#L30-L45

Why don't use all the frames to get more data into training ? Like:

utterances_spec = []
j = hp.data.tisv_frame
while j < S.shape[1]:
    utterances_spec.append(S[:, j - hp.data.tisv_frame:j])
    j += hp.data.tisv_frame

=====Update==== As for the second question, I tried and know why, is it because most sentences are no longer than 2 * tisv_frame, so you just crop the head and the tail ?

HarryVolek commented 5 years ago
  1. I just hard coded the values according to the following passage of https://arxiv.org/pdf/1810.04719.pdf: As a brief review, in the baseline system [3], a text-independent speaker recognition network is used to extract embeddings from sliding windows of size 240ms and 50% overlap. A simple voice activity detector (VAD) with only two full-covariance Gaussians is usedto remove non-speech parts, and partition the utterance into non-overlapping segments with max length of 400ms. Programmatically, the values needn't be hard coded.

  2. If you believe many overlapping segments of the same utterance would be beneficial for your training, you could modify the preprocessing script such in such a way.