Closed jmif closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Bump, would love to hear thoughts on this, thanks 🙏
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
First of all thank you so much for the work you put into this project, you're doing amazing work.
We've got a pre trained model deployed and working (fastspeech2 + mbmelgan) and we're beginning to work on ways to improve these models over time. There are two things we'd like to achieve:
We're trying to understand how a fine tuned voice clone process relates to / differs from teaching the model how to pronounce new words.
We've found quite a few examples of voice cloning via fine tuning in this repo and we're beginning our explorations here now. Thanks so much for these. We've also been researching how improving pronunciation of specific words works and we're having a hard time figuring that one out.
From the issues I've read it seems that improved pronunciation of words can be done via fine tuning by providing additional data samples, meaning that we can fine tune the available models with samples from a different voice while maintaining the same output voice. Is this correct? If so could you help us conceptually understand how this works / high level what the training process would look like? If we eventually want to clone our a voice via fine tuning, would that change how we teach pronunciations?
Thank you!