bshall / Tacotron

A PyTorch implementation of Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
https://bshall.github.io/Tacotron/
MIT License
112 stars 24 forks source link

Attention converging speed #1

Open artificertxj1 opened 3 years ago

artificertxj1 commented 3 years ago

Hi @bshall, I have a question about attention layer learning speed. The authors report a faster alignment learning speed by using DCA. Is that also the case observed in your training?

bshall commented 3 years ago

Hi @artificertxj1,

I haven't directly compared against location-sensitive attention, so I can't give you a complete answer, but DCA does seem to learn a good alignment pretty quickly. In my experiments you usually get good alignments by around 2k steps. There's a bit of instability (probably because I can't use batch sizes of 256) but it goes away after a bit more training. Based on some comments I've seen in other Tacotron implementations that seems to be pretty quick.

prattcmp commented 3 years ago

@bshall How much time in minutes did this take you?

bshall commented 3 years ago

Hi @prattcmp, on my GeForce RTX 2080 I get roughly 1 step per second so you should start seeing alignment from about 30-60 minutes. Obviously training to full convergence (250k steps for the pretrained model) takes much longer. You should be able to get intelligible speech pretty shortly after the alignment is learned though. So you can already test out the model at that point.

Just FYI, I found that the model takes a while to learn to pause at full-stops and commas for the appropriate amount of time. So if you are trying to synthesize speech early in the training it might be useful to add:

tacotron.decoder_cell.prenet.fixed = True

when you load the model. This helps stops the model from pausing indefinitely for some utterances.

prattcmp commented 3 years ago

Interestingly, I get 2s/it (0.5it/s) on a Tesla V100 at both batch sizes of 64 and 128. GPU utilization is low at ~20-40%. I've also modified disk I/O limits and no change, so its not a disk throughput issue. Any idea whats wrong?

bshall commented 3 years ago

Hi @prattcmp,

Yeah, the GPU utilization of the model isn't great. The culprit is the decoder cell. Since it's basically a costomized RNN you have to the looping in python and can't take advantage of the fast CUDA RNN implementations.

There are a couple of things you could do to increase GPU utilization if you wanted to experiment. The first is replace these two LSTMCells with torch.nn.LSTM layers and move them out of the loop. The downside is that the torch.nn.LSTM doesn't implement zoneout so you'd have to see if training works well with out it. It seems that it'd be okay based on NVIDIA's implementation of Tacotron 2. They write "We replace zoneout with dropout in the decoder rnn and remove it entirely from the encoder rnn." here.

Similarly, you can move the decoder prenet out of the loop and apply it to the input Mel-spectrograms in parallel. This would increase memory usage a bit but should help with the GPU utilization.