CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.54k stars 8.79k forks source link

Fixing the synthesizer's gaps in spectrograms #53

Closed TheButlah closed 4 years ago

TheButlah commented 5 years ago

Hello, and thank you for the great work! One of the limitations that I have noticed is that the synthesizer starts to have long gaps in speech if the input text length is short. @CorentinJ do you have any ideas why this is or how I could fix it? I'll also probably ask on Rayhane's repo if I can reproduce the issue on his synthesizer.

Am I correct in assuming that the issue is caused by the stop prediction in Taco2 not having a high enough activation, which results in long spectrograms?

CorentinJ commented 5 years ago

I doubt it's because of the stop prediction. The stop prediction only occurs after the spectrogram is generated. Yes, this is an issue of the synthesizer. It would have to be replaced by a better one (eliminating other problems with that) such as fatchord's, but I just don't have the time to do it.

TheButlah commented 5 years ago

I was referring to the stop prediction in Tacotron 2 (synthesizer not vocoder), I wasn't aware that stop prediction was used in WaveRNN as it can just stop outputting when it runs out of spectrogram frames to condition on.

What do you mean by "the stop prediction only occurs after the spectrogram is generated"?

CorentinJ commented 5 years ago

I wasn't talking about the vocoder. Tacotron's decoder being autoregressive, the first stop token above the threshold value will be predicted when the spectrogram is done being generated, by definition. Thus is has no impact on previous frames, in fact its output is not fed back to the model IIRC. I don't see how the stop token could be the issue.

TheButlah commented 5 years ago

Ah yes I see what you mean. That makes sense, I agree that it has to be another issue.

One idea that I had was annealing the level of teacher forcing that takes place during training. I suspect that the issue is that due to the synthesizer being autoregressive, any errors (deviation from true mel frame) are going to compound on each other as they get fed into the predictions for the next Mel frame. Teacher forcing accelerates training convergence because it removes the ability of these errors to propogate, but I would expect that the network would never learn to account for its own errors because it always was fed real data during training. Hence annealing the probability that the spectrogram frame is teacher forced might get the best of both worlds.

What do you think?

CorentinJ commented 5 years ago

I think the issue is elsewhere, as in most likely a bug from my end or rayhane's work. I've talked with someone else whose work also stems from rayahane's and he's got the same problem. Meanwhile, other implementations elsewhere (mozilla, nvidia, fatchord) of tacotron/tacotron2 do not have that issue.

TheButlah commented 5 years ago

Where is fatchord's implementation? I don't see it on his github

CorentinJ commented 5 years ago

It's included with his WaveRNN, the same I use: https://github.com/fatchord/WaveRNN

TheButlah commented 5 years ago

Oh I thought that was just a fork of Keithitos. Regardless, I'll look into using a different implementation and/or try to figure out whats wrong with rahayne's. Thanks for the help!

TheButlah commented 5 years ago

For what It's worth, Ive been working extensively on @fatchord's repo adding improvements to it. I've trained models on it and no longer experience the gaps in the audio we have observed using Rayhane's repo. However, the synthesizer is still somewhat sensitive to sentence length, particularly long sentences. Sentences four words or more in length are fine, but once sentences start to get really long, you get the same stammering you can observe in @CorentinJ 's repo. So yes, switching to @fatchord's synthesizer would probably be a big improvement, but you would also have to add to it the capability to do multi-speaker training, as right now it only has single-speaker capability.

I can also confirm that its an issue with the attention mechanism, not the stop token or anything else. @fatchord's repo just stops generating when the spectrogram frame is below a certain audio threshold. No stop tokens involved. You can also look at the attention graph and clearly see that the failure cases are due to the attention getting stuck on a particular time step and never progressing.

TheButlah commented 5 years ago

@CorentinJ actually on going back through my synthesized recordings from @rayhane-mamah's repo, I haven't been able to observe any of the gaps I observe in your repo. I think its actually unique to this repository

TheButlah commented 5 years ago

200K-logs-eval.zip (Rayhane Taco2, Griffin-Lim) Archive.zip(Fatchord Taco1, Fatchord WaveRNN) Both @fatchord and @Rayhane-mamah repos do not exhibit gaps in middle of spectrograms like this repo does.

They both exhibit failure in the case of especially long sentences, which is expected. Taco 2 appears to fare much better in this case.

CorentinJ commented 5 years ago

Oh I'm well aware the issue is present in this repo only. It's something I must have introduced while modifying rayhane's tacotron. Considering I hate to work with that codebase, I have in mind to switch to fatchord's tacotron to try and fix this bug at the same time. But as I said, I really don't have the time to work on that now, as I have work and university projects that take priority. If someone wants to work on that in a separate branch, I can definitely look over that from time to time.

As for long sentences, it's just a matter of the attention mechanism implemented. By splitting sentences on punctuation, you're fine with most sentences anyway.

TheButlah commented 5 years ago

Makes sense! I agree that I like fatchord's synthesizer more as its easier to work with, although I think it would perform better qualitatively if it were tacotron 2 instead of taco1. Maybe someone will do a fork for it at some point to upgrade it.

ghost commented 4 years ago

Thank you for referencing the issue @macriluke. I am going to reopen this issue since I have some interest in fixing it. Another possibility is that it goes away in #370 when @dathudeptrai modifies the tensorflowTTS/tacotron2 code to work with this repo.

ghost commented 4 years ago

I found a very low-tech fix for this, which is to always run "trim_long_silences" on the vocoder output. The function uses webrtcvad and is found in encoder/audio.py. Will submit a PR when I get a chance.

Choons commented 4 years ago

an even lower tech solution I use-- insert "scat" words/syllables at the beginning and end of the sentence and somehow it fixes the gaps. For instance, the sentence "I have something important to tell you" gaps terribly on its own, but "skee diddly bop I have something important to tell you action jackson" renders perfectly. Then I just trim the "scat" off in Audacity. Perhaps that can provide a hint what is wrong in the code.

ghost commented 4 years ago

Confirm that the issue of gaps in spectrograms will be resolved if we merge fatchord's tacotron1 in #472. The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

ghost commented 4 years ago

The presence of gaps depends on the training data. I get no gaps when training with VCTK, and plenty of gaps with LibriTTS.

As mentioned in https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/472#issuecomment-685943601 the gaps in LibriSpeech/TTS can be resolved by using voice activation detection to trim silences. See #501 for the process.

ghost commented 4 years ago

Would like to highlight this again:

The presence of gaps depends on the training data.

Trained a new synthesizer with a curated dataset, in #538 (tensorflow) and https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/472#issuecomment-695206377 (pytorch). This fixes the issue with gaps.

Choons commented 4 years ago

Wow, bluefish, you have done some incredible work on this! Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

And if it's a choice between the two solutions, which one do you recommend as best performing?

ghost commented 4 years ago

Can you clarify-- do we need to add BOTH the code from #538 and #472 , or do we choose just one of either? ie a tensorflow solution versus a pytorch solution.

Most users today will want #538 because we haven't formally switched to the pytorch synthesizer. Once #472 is merged we will update the pretrained models wiki page to point to pytorch.

And if it's a choice between the two solutions, which one do you recommend as best performing?

They're about the same in performance. They have different quirks since the tacotron is different (tacotron 1 vs 2). In tensorflow (Rayhane-taco2) the stop token prediction sometimes fails and it synthesizes a huge silence until the decoder limit is reached. In pytorch (fatchord-taco1) the attention may get stuck on a certain character and making inference quit suddenly. Pick your poison. The attention mechanism needs to be improved.

Choons commented 4 years ago

Understood. I'm glad you have taken on improving this voice project. I have tried to use other voice cloning implementations, but could never get them working as well as this one, even with the gap problem. I will experiment with both of your solutions and report back in this post how well they work for me.

ghost commented 4 years ago

Feedback is appreciated @Choons , it's always helpful to hear from those who are using the software and models.