CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.71k stars 8.8k forks source link

Data used for pretraining speaker encoder? #1032

Closed cheulyop closed 2 years ago

cheulyop commented 2 years ago

A similar question was asked in #78 but it was closed without an answer.

So, on which data is the provided speaker encoder pretrained? I looked through the wiki and issues but couldn't find an answer. Was it pretrained on a combination of LibriSpeech and VoxCeleb 1 & 2, as mentioned in the thesis? @CorentinJ

image

In our case, we are taking the pretrained encoder (encoder.pt) and looking to fine-tune its last linear layer and similarity scaling parameters with a dataset of our interest.

Knowing on which data the encoder was pretrained would be of much help.

CorentinJ commented 2 years ago

The training data was the training set from LibriSpeech, VoxCeleb1 Dev A - D and VoxCeleb2, resulting into 3201 hours of data with 8371 different speakers.