3.2. Diarization pipeline To perform the diarization, each input recording is first split into speech segments according to the oracle VAD and the segments shorter than 0.1 s are discarded. From these segments, x-vectors are extracted every 0.25 s from overlapping sub-segments of 1.5 s (or less than 1.5 s for the last sub-segments or shorter segments). The x-vectors are centered, whitened and length normalized (Garcia-Romero and Espy-Wilson, 2011) (which is also done for the PLDA training data).
In the report
However in predict.py https://github.com/BUTSpeechFIT/VBx/blob/57466e6e245d5cdfe2e88ee6503702ace3ffdd03/VBx/predict.py#L168 i.e segments shorter than 0.01s are discarded
https://github.com/BUTSpeechFIT/VBx/blob/57466e6e245d5cdfe2e88ee6503702ace3ffdd03/VBx/predict.py#L89-L90 i.e. x-vectors are extracted every 0.24 s from overlapping sub-segments of 1.44s