Two question on Embeddings

About the align_embeddings

After calculating all the Embeddings window-by-window, you then calculate the average Embedding with align_embeddings function. But why here you hardcode the threshold, and what do they mean ? https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/dvector_create.py#L60-L67

In the original paper, it says "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise average".

About the embedding frame.

In your implementation of data_preprocess.py, you only save the first N frame and the last N frame as below to .npy file, which may drop a lot of data if current utterance consists long frames: https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/data_preprocess.py#L30-L45

Why don't use all the frames to get more data into training ? Like:

utterances_spec = []
j = hp.data.tisv_frame
while j < S.shape[1]:
    utterances_spec.append(S[:, j - hp.data.tisv_frame:j])
    j += hp.data.tisv_frame

=====Update==== As for the second question, I tried and know why, is it because most sentences are no longer than 2 * tisv_frame, so you just crop the head and the tail ?

HarryVolek / PyTorch_Speaker_Verification

Two question on Embeddings #31

About the align_embeddings

About the embedding frame.