as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
577 stars 113 forks source link

Experiments / Discussion #7

Closed m-toman closed 2 years ago

m-toman commented 4 years ago

Hi,

great work! I saw your autoregression branch and wanted to ask if it worked out? I always wondered how much the effect of the autoregression (apart from the formal aspect that it then is a autoregressive, generative model P(xi|x<i)) really is, considering there are RNNs in the network anyway.

Also, wanted to point you to this paper in case you don't know it yet: https://tencent-ailab.github.io/durian/

They use, similarly to older models like in https://github.com/CSTR-Edinburgh/merlin, an additional value for the expanded vectors to indicate the position in the current input symbol. Wonder if that would help a bit with prosody.

m-toman commented 3 years ago

I struggled a week now with that suddenly I got a burp sound at the end of many sentences. Honestly still now idea why, seems to happen sometimes. I now force it to silence... I assumed it was that it aligned the final punctuation symbol to silence usually, but if the silence trimming trims to aggressively it has to align that symbol with voice. So I added a little bit of silence myself (quite common actually in older systems to have a silence symbol at the beginning and end and prepend and append artificial silence). But didn't help. Now I force it to silence after synthesis but no idea where it's coming from...

anyways, did you ever get the validation loss to make sense? For me it still gradually increases, although at probably 1/10th of what the training loss decreases. Tried really small model sizes, more dropout but still. Even the multispeaker model I currently train on 100k sentences does it, but admittedly less pronounced.

cschaefer26 commented 3 years ago

Did you check the padding? I had a similar problem once and found that padding values were at zero (and not at -11.5 for silence in the log space). Validation loss in this case does not mean anything imo since the model has too much freedom in predicting intonation, pitch etc without teacher forcing. I don't even look at it (for durations it still makes sense though imo).

m-toman commented 3 years ago

Hmm, good idea, thanks. You mean the mel padding here, right? https://github.com/as-ideas/ForwardTacotron/blob/d5c5d889b25617119a3191b2b440b7d24edf6827/utils/dataset.py#L207 But have to check how my mel representation differs from yours.

The loss masking should work now, so it should mostly be about the context from the LSTMs and convolutions "leaking" into the actual speech. Strangely I never had any issues until recently but checked all commits I think a dozen times, no model/dataset changes. I've retrained at different commit times but the answer was never clear. Sometimes there were slight issues at the end of the sentence, 3 of 4 trainings on the original commit were fine, but one had slight issues. Sometimes it occurs at some point during the training, then is gone again, sometimes gradually worsens.

m-toman commented 3 years ago

OK, checked it, for me silence is -4 and modified the padding. But then I've noticed I made a mistake in my posting above - convs/RNNs apply to the input text, not to the mel spectra. So with the loss masking fixed this should not have any effect, or do I miss something?

cschaefer26 commented 3 years ago

Any improvements with the padding? I agree that the glitch cant be from the loss directly due to masking, but I found something else - in my case the length regulator is slightly overexpanding, i.e. attaching some extra repeats of the last input vecs due to the padding within batches during training, I added this to put the repeated inputs to zero: https://github.com/as-ideas/ForwardTacotron/blob/611dd815a5390302daad2cfb0684e59dafa6866f/models/forward_tacotron.py#L171 Imo this could be one of the causes of a 'leak' in the RNNS.

cschaefer26 commented 3 years ago

Oh, btw, I am also now adding a pitch module and the first results seem very promising. Might be adding an 'energy' vec as in FastSpeech2 as well, although in their ablation study the gain was pretty small. I thought if one could instead of calculating F0 just use the mean of the frequency distribution along the mel axis? Imo this should be pretty similar and wouldnt require an external lib to do the calculation.

m-toman commented 3 years ago

I'll check the LR thing above. Not sure now if I used your solution of wrote something myself because I added positional index (didn't see a huge difference though, if at all). Strange that I never noticed it for the first months and then suddenly it appeared. Resetting to the commit from my last OK training really reproduced a non-burpy version... So I checked and checked again but no modifications to the model or dataloader or training procedure. But run 4 then also produced burps. It feels so semi random. For now I just overwrite the last symbol (in combination with forcing the last symbol to be punctuation and making sure there is silence in the audio at the end) with silence and that fixes the symptoms but still bugs me ;).

I'll post a sample when back home.

For pitch i userd the approach from fastpitch (repo is out there) which works fine with the proposed mean pitch per symbol but I am thinking about a more complex parameterization that also allows to control some delta features (perhaps just categories of falling/rising/flat pitch or so). EDIT: here it is https://drive.google.com/file/d/1p9dJjLzJ0p0R3v0XLz-Z6hr1Xwhj-5Gd/view?usp=sharing after changing the mel padding

m-toman commented 3 years ago

Until now it seems to be better - https://drive.google.com/file/d/1LkusT0VO8cKw3nI5jJ1GBmDLCu4jP_vv/view?usp=sharing Just 7k steps, it often started to happen after 60k steps. Generally I often feel there is not really much improvement after 10k steps Training, which is quite cool considering how long Taco takes to get the attention right (if at all).

cschaefer26 commented 3 years ago

Thats only 7k steps? Impressive. I found no real improvements with a pitch module, although its fun to play around with the pitch. May be a limitation of our dataset though (8hrs only). I also tested the Nvidia FastPitch implementation, not better. I thought about looking more into the data to be honest, e.g. cleaning inconsistent pronunciations with an STT model.

m-toman commented 3 years ago

Yeah I didn't see improvement either but we need it for implementing ssml tags and it definitely works better than synthesis-resynthsis methods which introduce too much noise. And yeah probably one can think about some generative model to produce/sample interesting pitch contours.

My results were a bit strange. With both paddings modifications above I started to get weird pauses/prosody. Trained 3 times to verify it's not some random effect and at different stages of training.

Integrating DiffWave might be interesting as well.

cschaefer26 commented 3 years ago

No improvements as in with the pitch? Generally I have the same problems as you have, it seems that trainings can vary by large degrees, probably there is some randomness in what the model really fits...

m-toman commented 3 years ago

Yeah sounds pretty much the same with and without pitch model. Well, still much more robust than most taco 2 implementations. Quite a few smaller datasets that did not work at all (in the sense that the output was cut off, broken words etc.) now work well. Sometimes prosody is not as natural (rarely) but better than garbage generated.

Will try multispeaker again soon but it seemed to me as if it would average a bit too much.

cschaefer26 commented 3 years ago

Sounds good. I am regularly comparing the model to other architectures and I find that the LSTM produces a bit more fluent output but tends to more mumbling compared to a transformer basted model a la FastSpeech. Multispeaker didn't really add some benefit so far, but that could be due to lack of data yet.

cschaefer26 commented 3 years ago

Just a quick update, I merged all the pitch stuff to master, I really found a benefit using the pitch condition. I see the same as you, after 10-20k steps the model is almost done. Quick question: Did you see any improvement with positional indexing? I found some generalization problems on smaller datasets, where the voice mumbles quite a bit especially for shorter sentences, weirdly. Also, I tried to add an STT model to the training and added a CTC loss to hope that the model is forced to be clearer, first results seem quite promising actually.

m-toman commented 3 years ago

No new experiments from my side atm. I am not fully sure about the positional index, I added it before all the other stuff and kept it in although I didn't hear a major difference.

    def expand(self, x, durations):
        idx = self.build_index(durations, x)
        y = torch.gather(x, 1, idx)
        if self.posidx:
            duration_sums = durations.sum(dim=1)
            max_target_len = duration_sums.max().int().item()
            batch_size = durations.shape[0]
            position_encodings = torch.zeros(
                batch_size, max_target_len, dtype=torch.float32).to(durations.device)
            for obj_idx in range(batch_size):
                positions = torch.cat([torch.linspace(0, 1, steps=int(dur), dtype=torch.float32)
                                       for dur in durations[obj_idx]]).to(durations.device)
                position_encodings[obj_idx, :positions.shape[0]] = positions
            y = torch.cat([y, position_encodings.unsqueeze(dim=2)], dim=2)
        return y

I've checked out the samples of Glow-TTS in Mozilla TTS but did not really seem convincing. My main issue atm is that the prosody could be better.

cschaefer26 commented 3 years ago

Intuitively I wouldnt expect worlds difference with pos indexing with lstms though. As for prosody, did you try to use a smaller separate duration predictor as I do? I found that the model is hugely overfitting otherwise (e.g. when duration prediction is done after the encoder). Also for prosody I have an idea I want to try out soon - similar to the pitch frequency i want to leak some duration stats from the target, e.g. some running mean of durations to condition the duration predictor on. My hope is that the model would pick up some rhythm / prosody swings in longer sentences (similar to the pitch swings).

m-toman commented 3 years ago

I tried running the duration predictor before the CBHG once but the results were a bit strange. Will try again. Also wondered if it's really a good idea to train it together with the rest or should rather be separate (or at least stop training it at some earlier point).

So you already added some "global" pitch stats to the model? Have to check your code. Could probably instead of just expanding states according to the duration value also feed the duration value itself to the Mel predictor, don't know if that would help it.

cschaefer26 commented 3 years ago

My best resuls were with a mere 64 dim gru on duration prediction with lots of dropout before. Yes, it probably makes sense to completely separate it (to be able to compare results at different stages to the least). Yeah I reimplemented the FastPitch version (with minor differences) with pitch averaged over chars.

m-toman commented 3 years ago

I'll try the different duration model as soon as I got the capacity. Also wanted to try out https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/