During inference time, for every utterance we apply a sliding
window of fixed size (lb + ub)/2 = 160 frames with 50% overlap.
We compute the d-vector for each window. The final utterance-wise
d-vector is generated by L2 normalizing the window-wise d-vectors,
then taking the element-wise averge (as shown in Figure 4). read the paper Generalized End-to-End Loss for Speaker Verification
During inference time, for every utterance we apply a sliding window of fixed size (lb + ub)/2 = 160 frames with 50% overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4). read the paper Generalized End-to-End Loss for Speaker Verification