Open artificalaudio opened 1 year ago
I have the same question about finetuning from a base pretrained model to a custom voice!
interesting too
The training procedure should work at different frequencies, given that source and target audio remain synced. The CPU voices for the official app are trained at 22.5kHz, for example. Training time is indeed longer for higher sample rates, and I can't guarantee that convergence for datasets over 22.5kHz.
As for finetuning to speaker identifies: in personal experiments I've found the procedure to be a little faster, 1-2 days. But I can't give you a hard number.
Hi,
Apologies if this is a silly question. If I train a model with a dataset made from a different Sample rate; will this technique still work? eg the training data would come from normal speech/singing @40khz, and time synced pairs of response from a 40khz RVC model.
Without changing anything internal to the LLVC model, can I use a different Sample Rate? (granted that I've made a dataset at 40khz for instance)
(Would changing the SR in config actually do anything to the model?)
I think the paper said 3 days on a decent GPU, I'm guessing training time would be more for higher sample rate.
Also I'm intrigued about the paper's mention of fine tuning to speaker identities. Whether it's always 3 days training, or once you have a base pretrained model, the fine tuning to custom voice is less time.
Thank you