Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Synthesis worse than evaluation. #206

Open splitice opened 6 years ago

splitice commented 6 years ago

Currently fairly early into training. Only 14k steps in, already impressively the step evaluation wavs can be understood.

However when the same text is provided to synthesis.py (via hparams) it comes out garbled. What could be wrong?

The hparams are largely unmodified (the only changes being batch_size and related dependent vars). Both are attached with wav-1-linear.wav being the output of synthesis. both-wavs.zip

Rayhane-mamah commented 6 years ago

Hi, thanks for reaching out.

Evaluation uses teacher forcing (we feed correct predictions to the decoder for each prediction step). We do that to control overfit during evaluation as this only judges the model on its ability to make correct future predictions based on 100% correct past predcitions. I.e: this only controls the efficiency of the conditional preperty of our model.

In pure synthesis time, we do not use any teacher forcing and the model has to rely on preciously made predictions. As widely used in encoder-decoder models, good predictions require that the model makes good alignments or else output is purely noise. I am assuming your model didn't align yet. You should check our wiki for more information about attention in general. It's a little outdated but it has the essential.

Also please keen in mind that batch size must not be smaller than 32 and that you can use output_per_step=2 or higher (to a limit of 3 preferably) if your GPU resources are limited.

Also feel free to have a look at your alignment plots for confirmation. Cheers!

On Tue, 18 Sep 2018, 15:03 Mathew Heard, notifications@github.com wrote:

Currently fairly early into training. Only 14k steps in, already impressively the step evaluation wavs can be understood.

However when the same text is provided to synthesis.py (via hparams) it comes out garbled. What could be wrong?

The hparams are largely unmodified (the only changes being batch_size and related dependent vars). Both are attached with wav-1-linear.wav being the output of synthesis. both-wavs.zip https://github.com/Rayhane-mamah/Tacotron-2/files/2392971/both-wavs.zip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/206, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwGz6vhRNINbt9ZNBKQQ7hBB_ALBMks5ucP03gaJpZM4Wt_uy .

splitice commented 6 years ago

Thank you so much for your response.

And that explains why the results were so good for only 14K steps. I'll check again in the morning (expected approx 35K steps). Alignment still leaves much to be desired (only just starting to show some light colour on the diagonal).

Currently I'm running a batch size of 16 as this seems to be the only way to get the training to fit in the 8GB available. As 8GB or less is common in consumer cards it would be great to receive some advice in regards to appropriate hparams for these setups. The default needs more (appears to be around 11 or 12GB).

Specifically what I changed to get the training running:

tacotron_batch_size=16
tacotron_synthesis_batch_size = 16 * 16

According to some other issues here outputs_per_step=2 with a batch size of 32 would need 8.7GB (I have not confirmed). So it's likely I would need outputs_per_step=3 to fit into 8GB. Increasing outputs per step would be significantly better than decreasing batch size?

And I will certainly be reading the Wiki.

Rayhane-mamah commented 6 years ago

Yes for 8GB GPU you will most likely either use r=3 or constraint your data length to 700 frames as maximal sentence length. (max_mel_frames parameter). Default is set to 1000 but LJspeech has no sentences longer than 800 frames.

Here are some usual values I see when training: (all for batch size 32)

Wether to use r=3 or r=2 with smaller sentences depends on your goals (chatbot vs long sentences reader) and they also depend on the data available at hand. Some datasets will have most their utterances around 800 frames while others only have a small portion around theses values. So my advice? Test things out and pick what works best for your end goal and your case really.

Hope this was helpful

On Tue, 18 Sep 2018, 15:37 Mathew Heard, notifications@github.com wrote:

Thank you so much for your response.

And that explains why the results were so good for only 14K steps. I'll check again in the morning (expected approx 35K steps). Alignment still leaves much to be desired (only just starting to show some light colour on the diagonal).

Currently I'm running a batch size of 16 as this seems to be the only way to get the training to fit in the 8GB available. As 8GB or less is common in consumer cards it would be great to receive some advice in regards to appropriate hparams for these setups. The default needs more (appears to be around 11 or 12GB).

Specifically what I changed to get the training running:

tacotron_batch_size=16 tacotron_synthesis_batch_size = 16 * 16

According to some other issues here outputs_per_step=2 with a batch size of 32 would need 8.7GB (I have not confirmed). So it's likely I would need outputs_per_step=3 to fit into 8GB. Increasing outputs per step would be significantly better than decreasing batch size?

And I will certainly be reading the Wiki.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/206#issuecomment-422418673, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwAEbNM2hg3wYccprE_s9S6qbyIi-ks5ucQUWgaJpZM4Wt_uy .

splitice commented 6 years ago

Thanks for that information, it's very helpful. I'll be experimenting with these tomorrow. Unfortunately a quick test with the default settings and changing specifically r=3, max_mel_frames=800 I am still seeing >8GB of memory usage for python3 (and the resulting OOM crash). This occurs within 90 steps.

With r=3,max_mel_frames=800,batch_size=22 I was able to get it to run however. I'll check on it in the morning after ~20K iterations and compare with the r=1,batch_size=16 run I was doing.

My particular interest is seeing if it's possible to get this model to the point a ARM SBC could execute the synthesis of a couple short sentences within a couple minutes (faster better). This means limited memory for inference (~440MB) and a lower spec CPU compared with desktops & servers. Do you have any gut feelings regarding feasibility?

Audio quality is not the highest priority, perfectly happy to sacrifice minor quality aspects for performance / model size. The main competing option is flite and similar (so the bar is fairly low). Any hparams that come to mind?

Rayhane-mamah commented 6 years ago

Your inference conditions seem quite challenging for Tacotron which has between 27 and 29 million parameters depending on wether or not you use linear predictions. Such high number of parameters is mostly caused by RNN so you may want to explore other convolution based models like DeepVoice.

As for meeting your inference requirements, you will definitely need some engineering techniques as 440MB is very limited if one wants to use the Tacotron architecture untouched (the model parameters alone are 300MB).

Maybe people can contribute with some bright ideas that may help in your task, but yeah in my opinion, there can be work arounds from model pruning to lowering model precision etc. And like I said, you probably would benefit more from usimg CNNs as RNNs are inherently slow and tend to use large amounts of memory.

That's my personal opinion, I may of course be missing some things. In any case I wish you best of luck :)

On Tue, 18 Sep 2018, 16:42 Mathew Heard, notifications@github.com wrote:

Thanks for that information, it's very helpful. I'll be experimenting with these tomorrow. Unfortunately a quick test with the default settings and changing specifically r=3, max_mel_frames=800 I am still seeing >8GB of memory usage for python3 (and the resulting OOM crash). This occurs within 90 steps.

With r=3,max_mel_frames=800,batch_size=22 I was able to get it to run however. I'll check on it in the morning after ~20K iterations and compare with the r=1,batch_size=16 run I was doing.

My particular interest is seeing if it's possible to get this model to the point a ARM SBC could execute the synthesis of a couple short sentences within a couple minutes (faster better). This means limited memory for inference (~440MB) and a lower spec CPU compared with desktops & servers. Do you have any gut feelings regarding feasibility?

Audio quality is not the highest priority, perfectly happy to sacrifice minor quality aspects for performance / model size. The main competing option is flite and similar (so the bar is fairly low). Any hparams that come to mind?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/206#issuecomment-422444183, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwNRdZHACQA3Eh8E1aQ6BA8lkR9fSks5ucRRvgaJpZM4Wt_uy .

splitice commented 6 years ago

Previously I've only ever sucessfully done image recognition work with CNN/DNN and only ever minor adjustments and integration of existing models. So thanks for being so helpful. I am learning ALOT here.

I'm hoping that @ruiboshi's frozen graph model optimized for inference will be able to help out in regards to memory usage. The frozen graph structure (including the variables turned constant) comes down to 100MB in binary format on disk which looks promising. The biggest unknown is the size of the remaining variables (input, output, intermediary), and how much this costs with tensorflow (e.g does tensorflow deallocate as it moves through forward moving sections of the graph or does it persist memory for more performant future inferences?).

Current testing with the model loaded from a checkpoint and the python process shows approx 650MB ram usage on a 64bit PC. If approx 15% of that is pointers (back of the envelope) then approx 600MB on a 32bit architecture. This is with the additional linear predictor.

Hopefully there is alot that be eliminated through removal of unreferenced (for inference) nodes & variable -> constant optimizations. Quantitization is possible too in some areas I expect.

One of the current models in training I've adjusted the spectogram size and resulting sample rate so that should also provide a reduction if it works.

I'm certainly open to considering other options including Deepvoice if RNNs are too intensive for the target.

lkfo415579 commented 6 years ago

@Rayhane-mamah I am a little confusing about this two parameters outputs_per_step and batch_size. Could you explain why we should not change the batch_size but outputs_per_step in order drop down memory size? will lower batch_size affect the equality?

splitice commented 6 years ago

@lkfo415579 In my experience larger batch sizes train & align much quicker. At 36k steps my batch_size=22 run is showing better signs of alignment. My batch size=16 run (with similar hparams) when I ended it at 42k was not this good.

step-36000-align

This is more than a 45% improvement in alignment (as per batch size), and the step time is largely unaffected. I can only imagine the progress would be better at batch_size=32

splitice commented 6 years ago

@Rayhane-mamah

A Tacotron optimized for inference graph (thanks @ruiboshi) peaks at just over 300mb :)

No wavenet model to test with just yet. Training will begin on that this weekend.

Some notes: my hparams are set to produce 16khz audio, with a reduction in mels. The reasoning behind this is the target hardware only supports 16k output, so the extra memory is unnecessary.

splitice commented 6 years ago

Argh, I made a mistake. There is a peak of ~880MB that I missed due to sampling the used memory at too low of a rate.

Wavenet looks to be worse again at 2GB.