how to improve voice quality?

wotulong commented 5 years ago

Thanks for your greate job.I have tried this project in my own computer(win10, 1060ti 3gb), and I think the similarity of voice is good.Do you have any ideas of how to improve the qulity and similarity of voice ?Does para wavenet is a good way? thang you。

CorentinJ commented 5 years ago

My implementation of the synthesizer and of the vocoder aren't that great, and I've also trained on LibriSpeech when LibriTTS would have been preferable. I think fatchord's WaveRNN is very good still, and I wouldn't change it for another vocoder right now.

If someone were to try to seriously improve on the quality, I would recommend using both his synthesizer and vocoder instead of the ones I currently have, and to train on LibriTTS.

While I would be willing to help this come to fruition in my repo, I cannot afford to work fulltime on it, unfortunately.

sdx0112 commented 5 years ago

Are you using the toolbox to clone your voice? I tried to clone from a 30 mins audio file of a single speaker by loading this audio file to toolbox, but the resulting voice is not quite similar to the input.

CorentinJ commented 5 years ago

My voice works poorly with the model, others work nice. I would not recommend using a 30 mins audio file. While technically it should work, the framework is meant to operate with a short input speech. Try cutting only 5 seconds from that audio. If the speaker is not a native English speaker, there's a good chance the resulting voice will be poorly cloned.

sdx0112 commented 5 years ago

The speaker has a very thick voice, but the cloned result sounds like a normal person.

wotulong commented 5 years ago

The speaker has a very thick voice, but the cloned result sounds like a normal person.

May be you could try split the voice into five seconds pieces, and get embds of all these pieces, using the mean embd of embds,then synthesis.I don't know if it's help, may be you could have a try.

CorentinJ commented 5 years ago

The speaker has a very thick voice, but the cloned result sounds like a normal person.

Yes, the synthesizer is only trained to output a voice at all times. Even if you input noise as reference or a piece of music, you will get an "average human" voice as output. That's normal. It also means that you're giving the model a reference audio it has not learned to generalize with. Either:

The speaker encoder cannot create a meaningful embedding of that voice, because it has not generalized to it.
The speaker encoder generates a meaningful embedding of that voice, but the synthesizer is not able to properly reproduce the voice in the embedding, because it has not generalized to it.

More often than not it's going to be the second case. The speaker encoder I've provided is excellent at its job, but the synthesizer and vocoder were trained on a limited dataset, and suffer from a few limitations (such as not using phonemes, using r=2, using location-sensitive attention - note, this limits the quality, not the voice cloning ability).

May be you could try split the voice into five seconds pieces, and get embds of all these pieces, using the mean embd of embds,then synthesis.I don't know if it's help, may be you could have a try.

This is actually what happens internally, up to a normalizing constant. The resulting embedding is the average of the frame-level embeddings (called d-vectors) from a sliding window of 1.6 seconds (with a stride of 800ms, so 50% overlap): If you pass a short input, you're going to have a small number of embeddings to average from, e.g. between 5 and 7. If you pass a very long input, it's going to make an average of a lot more embeddings, and that will be problematic if there's a lot of variation of the voice in your audio. If there isn't, and the voice is fairly consistent, then it's fine; but it's not likely going to improve the performance of the model due to what I said above.

This might slightly be improved by using speaker embeddings instead of utterance embeddings, it's something in my TODO list. Excerpt from my thesis:

In SV2TTS, the embeddings used to condition the synthesizer at training time are speaker embeddings. We argue that utterance embeddings of the same target utterance make for a more natural choice instead. At inference time, utterance embeddings are also used. While the space of utterance and speaker embeddings is the same, speaker embeddings are not L2-normalized. This difference in domain should be small and have little impact on the synthesizer that uses the embedding, as the authors agreed when we asked them about it. However, they do not mention how many utterance embeddings are used to derive a speaker embedding. One would expect that all utterances available should be used; but with a larger number of utterance embeddings, the average vector (the speaker embedding) will further stray from its normalized version. Furthermore, the authors mention themselves that there are often large variations of tone and pitch within the utterances of a same speaker in the dataset, as they mimic different characters (Jia et al., 2018, Appendix B). Utterances have lower intra-variation, as their scope is limited to a sentence at most. Therefore, the embedding of an utterance is expected to be a more accurate representation of the voice spoken in the utterance than the embedding of the speaker. This holds if the utterance is long enough than to produce a meaningful embedding. While the “optimal” duration of reference speech was found to be 5 seconds, the embedding is shown to be already meaningful with only 2 seconds of reference speech (see table 4). We believe that with utterances no shorter than the duration of partial utterances (1.6s), the utterance embedding should be sufficient for a meaningful capture of the voice, hence we used utterance embeddings for training the synthesizer.

Funnily enough my "interactions with the authors" was in the youtube comment section: https://www.youtube.com/watch?v=AkCPHw2m6bY

I'm going to reopen the issue because, well, it's a limitation of the framework (see the original paper, the voice cloning ability is fairly limited), of my models and of the datasets; so it's going to be a long-lasting issue that other people will want to know about.

KeithYJohnson commented 5 years ago

Does anyone have some five second sound clips that they ran through this implementation that they think are really good and be willing to share those? I've tried some five second clips of several celebrities and words are frequently dropped from the output .wav and also there's a "windy" type noise that fills the gaps between words.

djricafort commented 4 years ago

I tried using fatchord's WaveRNN model for the vocoder but I get a size mismatch error. do i need to modify something first before using a different pretrained model?

RuntimeError: Error(s) in loading state_dict for WaveRNN:
    size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 23]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 17]).
    size mismatch for fc3.weight: copying a param with shape torch.Size([30, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
    size mismatch for fc3.bias: copying a param with shape torch.Size([30]) from checkpoint, the shape in current model is torch.Size([512]).

wotulong commented 4 years ago

I tried using fatchord's WaveRNN model for the vocoder but I get a size mismatch error. do i need to modify something first before using a different pretrained model?

RuntimeError: Error(s) in loading state_dict for WaveRNN:
  size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 23]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 17]).
  size mismatch for fc3.weight: copying a param with shape torch.Size([30, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
  size mismatch for fc3.bias: copying a param with shape torch.Size([30]) from checkpoint, the shape in current model is torch.Size([512]).

I think it's because of the different value of params "hop_size"， maybe you can check it.

shitijkarsolia commented 4 years ago

I tried using fatchord's WaveRNN model for the vocoder but I get a size mismatch error. do i need to modify something first before using a different pretrained model?
RuntimeError: Error(s) in loading state_dict for WaveRNN:
  size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 23]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 17]).
  size mismatch for fc3.weight: copying a param with shape torch.Size([30, 512]) from checkpoint, the shape in current model is torch.Size([512, 512]).
  size mismatch for fc3.bias: copying a param with shape torch.Size([30]) from checkpoint, the shape in current model is torch.Size([512]).
From the author's paper: "The vocoder model we use is an open source PyTorch implementation 15 that is based on WaveRNN but presents quite a few different design choices made by github user fatchord. We’ll refer to this architecture as the “alternative WaveRNN”."

The current vocoder being used is already WaveRNN

Fitz0911 commented 3 years ago

When I upload a voice it doesn't work very well. But when I record the voice it's almost spot on.

RakshithRAcharya commented 3 years ago

Can I take up this issue?? Try using pydub for audio enhancements

Niklss commented 3 years ago

You know, I just decreased the sample rate and it worked for me pretty well!!! sf.write(filename, generated_wav.astype(np.float32), round(synthesizer.sample_rate / 1.1))

At least it reduced the noise =)

zubairahmed-ai commented 3 years ago

You know, I just decreased the sample rate and it worked for me pretty well!!! sf.write(filename, generated_wav.astype(np.float32), round(synthesizer.sample_rate / 1.1))

At least it reduced the noise =)

The voice doesn't sound anything at all like mine :(

simteraplications commented 3 years ago

@KeithYJohnson which celebreties voice you tried to use? From my experience this can happen if one or more of these circumstances are given.

the voice is high. With voices of children or voices that reach that level it is almost completely impossible to get a good clone. For example i once tried to clone an earlier voice off mine just for the fun. It didn't sound like the original, had this windy noise (it is probably the vocoder or the synthesizer trying to generate a breathing of some sorts) and also you suddenly had these moments sometimes for entire sentences and sometimes starting at some point and then just geting worse through the entire generation. It was that the voice suddenly got much deeper then it was soposed to but it was still kind off mixed to the original which made a weirdish and grumpy output happen. It can also happen that it speaks but speaks giberish like ababababa. But i only had that happen like 1 time.
You give it a to long sentence. You should keep them arround 5 till 12 seconds and then make a break by placing a blanc line. Unfortunately some of the going arround google colab books do not allow this, effectively making it impossible to generate long sentences, at some point you hear only breathing.
the sample provided is not english. It actually works some sort but much much worse then english sample. Some times you get a pretty decend output actually (only happened 1 out off 5 times when i was messing with that though), sometiems outputs that do not sound like the original and are filled with breating and sometimes just pure breathing or clippy sounds.

Can you maybe give info whom you tried to generate from and which text?

CorentinJ / Real-Time-Voice-Cloning

how to improve voice quality? #41