Janghyun1230 / Speaker_Verification

Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"
MIT License
354 stars 102 forks source link

Inference different from paper #2

Closed fazlekarim closed 5 years ago

fazlekarim commented 6 years ago

Hi,

Are you doing the following?

During inference time, for every utterance we apply a sliding window of fixed size (lb + ub)/2 = 160 frames with 50% overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4)

Janghyun1230 commented 6 years ago

This is for text-independent case. The way I did to compute the d-vector is same as paper as you said. However, In my case I split the audio with blank so that each segment in the dataset contains contents (word). But some segments are too short for sliding window and in this case I window and slice the front and back of the segment.