New word pronunciation vs Voice Cloning

jmif commented 2 years ago

First of all thank you so much for the work you put into this project, you're doing amazing work.

We've got a pre trained model deployed and working (fastspeech2 + mbmelgan) and we're beginning to work on ways to improve these models over time. There are two things we'd like to achieve:

Custom voices / voice cloning
Improved pronunciation of specific words

We're trying to understand how a fine tuned voice clone process relates to / differs from teaching the model how to pronounce new words.

We've found quite a few examples of voice cloning via fine tuning in this repo and we're beginning our explorations here now. Thanks so much for these. We've also been researching how improving pronunciation of specific words works and we're having a hard time figuring that one out.

From the issues I've read it seems that improved pronunciation of words can be done via fine tuning by providing additional data samples, meaning that we can fine tune the available models with samples from a different voice while maintaining the same output voice. Is this correct? If so could you help us conceptually understand how this works / high level what the training process would look like? If we eventually want to clone our a voice via fine tuning, would that change how we teach pronunciations?

Thank you!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

jmif commented 2 years ago

Bump, would love to hear thoughts on this, thanks 🙏

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

TensorSpeech / TensorFlowTTS

New word pronunciation vs Voice Cloning #719