NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.06k stars 1.38k forks source link

Tacotron2 is generating new mel spectograms and alignment for the same input text sequence while inferencing. #553

Open nav99een opened 2 years ago

nav99een commented 2 years ago

Hi everyone, I am trying to generate speech from text using tacotron2 finetuned on my custom dataset. So, while inferencing, As you can see from the image attached below, I am getting mel-spectograms with different shape and alignments for the same input sequence.

issue

I tried removing the dropout layer in prenet but then model generates no alignment at all. I tried almost all other suggested approaches for the same but it failed. I also tried to perform inference using torch.no_grad(). I don't want to generate new alignments everytime. Please help me out. Thanks in advance!!

PiotrDabkowski commented 2 years ago

This is due to the prenet dropout, the problem is that the network is now dependent on the prenet dropout and does not generalize to it being off. The tacotron2 is hence not deterministic by design (at lest as implemented in this repo).

lirus7 commented 2 years ago

@PiotrDabkowski, Any specific reason for pre-net dropout being different from a regular dropout?

PiotrDabkowski commented 2 years ago

It disrupts feedback. In mel spec the frames are highly similar, you can just predict the previous frame to do quite well on the loss fn. If you remove the dropout then this can lead to positive feedback. I guess the solution here would be to randomly turn off prenet droput during training.

Also you can use manual seed during inference to get the same results.

Sytronik commented 2 years ago

Here are useful insights about dropout in PreNet: https://github.com/mozilla/TTS/issues/50