Closed desis123 closed 1 week ago
The 100-hour audio requirement does not need to come from a single speaker. However, if your training data contains audio from specific speakers, the model will perform better when cloning those speakers' voices compared to speakers who were not present in the training dataset.
I have found that a good model requires approximately 100 hours of audio data for training. My question is, does this 100-hour requirement need to consist of a single speaker's voice, or can it include multiple speakers to reach the total duration?