Closed mazzzystar closed 5 years ago
I just hard coded the values according to the following passage of https://arxiv.org/pdf/1810.04719.pdf:
As a brief review, in the baseline system [3], a text-independent speaker recognition network is used to extract embeddings from sliding windows of size 240ms and 50% overlap. A simple voice activity detector (VAD) with only two full-covariance Gaussians is usedto remove non-speech parts, and partition the utterance into non-overlapping segments with max length of 400ms.
Programmatically, the values needn't be hard coded.
If you believe many overlapping segments of the same utterance would be beneficial for your training, you could modify the preprocessing script such in such a way.
About the align_embeddings
After calculating all the
Embeddings
window-by-window, you then calculate the average Embedding withalign_embeddings
function. But why here you hardcode the threshold, and what do they mean ? https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/dvector_create.py#L60-L67In the original paper, it says "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise average".
About the embedding frame.
In your implementation of
data_preprocess.py
, you only save the firstN
frame and the lastN
frame as below to.npy
file, which may drop a lot of data if current utterance consists long frames: https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/07f996ab9e06811e25b257c2098391e426f68b61/data_preprocess.py#L30-L45Why don't use all the frames to get more data into training ? Like:
=====Update==== As for the second question, I tried and know why, is it because most sentences are no longer than 2 *
tisv_frame
, so you just crop the head and the tail ?