We use raw audio files as the input for speaker embedding training because we don't have enough space to save the intermediate features. Different from TTS, speaker verification requires a massive scale of data (more than 1000h) for good performance. More data needs more space. If your server has enough space for storing fbank features, I suggest you preprocess and use the other dataloader.
We use raw audio files as the input for speaker embedding training because we don't have enough space to save the intermediate features. Different from TTS, speaker verification requires a massive scale of data (more than 1000h) for good performance. More data needs more space. If your server has enough space for storing fbank features, I suggest you preprocess and use the other dataloader.