CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.12k stars 8.72k forks source link

Request: Add support for GSTs (Global Style Tokens) (Tacotron Prosody from second ref file) #230

Closed steven850 closed 4 years ago

steven850 commented 4 years ago

So im not %100 sure this is already implemented in the version of tacotron thats included with this build but tacotron has support for GSTs (global Style tokens) now I think its already using these because if I record a sample and I speak in a monotone voice the output speech is also monotone, if I make the voice happier and fluctuate the intonations the output does the same which makes me think that the tacotron here is using the GSTs. What I would like to do is use a second audio file as a reference for the GSTs instead of the original voice sample. So my voice sample is Happy and upbeat, and I want the output to be sad for example. So I want it to use the voice from sample 1 and the prosody from sample 2. Here is what im refering too with really nice examples. https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html As you can see in the image below, it analyzes the input file and creates the GSTs for the output. Now assuming this is already running in this version, is there a way I can have it use a second file for just the prosody? Can anyone help with this? image2

Now if the version of tacotron included here doesnt have GSTs enabled, is it possible to replace the tacotron included with this version here? https://syang1993.github.io/gst-tacotron/
If anyone could help out with this I would GREATLY appreciate it.

ghost commented 4 years ago

The tacotron2 used here is based on Rayhane-mamah's implementation which does not support GST.

ghost commented 4 years ago

Mozilla's TTS has a PR implementing this feature: https://github.com/mozilla/TTS/pull/451

ghost commented 4 years ago

This is a simple concept but difficult to implement in a user-friendly way (by which I mean being able to accomplish the desired result without having to edit code). Mozilla TTS asks the user to specify the prosody embedding directly, and then this is concatenated in an identical manner as the speaker embedding. That kind of implementation is only suitable for researchers and users with a technical background.

1803.09047 mentions the use of a "prosody encoder" for automatic classification. However, I think the architecture we have for the speaker encoder could work for this, except one would put the wav files into separate folders representing each desired prosody feature instead of by speaker.

Finetuning an existing synthesizer model in this way does not seem reasonable unless 1) the prosody dataset is sufficiently large, or 2) the baseline synthesizer has consistent prosody. I can see training a synth on LibriTTS and calling that "American accent", followed by finetuning on some subset of VCTK and defining that as "UK accent". Aside from accent, I do not see any reasonable use cases for finetuning which means the prosody encoder needs to be in use during initial training of models.

One idea I have is to have a generic "encoder" module that operates an encoder network consisting of a speaker encoder and one or more prosody encoders. These encoders could be trained independently. Something that might work well is to first train the prosody encoders, then concat the prosody embedding with the speaker embedding when evaluating GE2E loss during training of the speaker encoder. Doing this should make the speaker encoder's utterance embeddings more invariant with respect to prosody.

In any event, having scoped out this issue, I have decided against working it due to a general lack of interest, and also because no simple, user-friendly implementation exists. If you want the feature, you could try Mozilla TTS as GST support is merged in their dev branch now.

steven850 commented 4 years ago

Hi blue-fish I see that you are actively working on this repo, I would love to help with testing and training. I can offer my 3990x to do some heavy lifting. How can i get in touch?

ghost commented 4 years ago

Hi @steven850 , thank you for offering to contribute your time and hardware! Please download the LibriTTS train-clean-100 and train-clean-360 datasets (available at https://openslr.org/60 ) and we'll put your amazing CPU to work. I have in mind training a new vocoder model with higher sampling rate (22,050 Hz instead of the current 16,000 Hz). Should have the logistics for communication figured out shortly.

steven850 commented 4 years ago

@blue-fish I grabbed the datasets. I also still have the datasets that corentin trained on originally. what about a discord server?

ghost commented 4 years ago

@steven850 I need to take some time to decide on the specific parameters for the vocoder, I have in mind to make the parameters for 22,050 Hz compatible with some other vocoders, like WaveGlow, MelGAN or Parallel WaveGAN. If you have any thoughts or ideas on this let me know.

I invited you to a repo with a link to the slack that I've set up for this. I've also been using that to communicate with mbdash on encoder training.

ghost commented 2 years ago

GST has been integrated with RTVC in https://github.com/babysor/MockingBird/pull/137