caizexin / tf_multispeakerTTS_fc

the Tensorflow version of multi-speaker TTS training with feedback constraint
MIT License
40 stars 30 forks source link

why not preprocess wav to fbank before train speaker embedding? it is too long that training while processing #5

Open jiazj-jiazj opened 2 years ago

caizexin commented 2 years ago

We use raw audio files as the input for speaker embedding training because we don't have enough space to save the intermediate features. Different from TTS, speaker verification requires a massive scale of data (more than 1000h) for good performance. More data needs more space. If your server has enough space for storing fbank features, I suggest you preprocess and use the other dataloader.