NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Tacotron GST: Inference not working #302

Closed ErnstTmp closed 5 years ago

ErnstTmp commented 5 years ago

I am running Tacotron GST with mailabs or LJ.

In Audio tab in Tensorboard, the Train_audio sounds good after the model has converged.

However, if I feed an audio from the training set for inference, the generated samples are not understandable (even if I feed the same audio & text as in the Tensorboard)

generate.csv.zip

I run inference with

python run.py --config_file=example_configs/text2speech/tacotron_gst_enu.py  --mode=infer --infer_output_file=unused

Thanks a lot, Ernst

blisc commented 5 years ago

Can you provide a sample of the eval audio from training and a sample of the learned attention alignments?

Does the problem still persist with the provided trained checkpoint? (Note it was trained with MAILABS data)

Train audio is not a good indication of model convergence. A good way to tell is if the eval audio is good or the learned attention alignments (train and/or eval) are approximately diagonal.

ErnstTmp commented 5 years ago

1) If I use the trained checkpoint, all works well 2) The eval audio sounds bad 3) The eval loss is bad, see enclosed file. Training loss looks good. eval 4) The alignments are not diagonal at all 5) All training / Eval Data is from mailabs German data set

Can you give me an advice how to proceed?

blisc commented 5 years ago

I never tested tacotron GST on the German MAILABS data set so it is possible that tacotron GST is unable to learn that dataset.

I would advise you to limit the dataset to just one speaker and see if the eval audio becomes recognizable. If that works, try adding an additional speaker one at a time. If that doesn't work, then try further limiting it to just one book.

I would also advise you to play around with the emb_size, attention_layer_size, num_tokens, and num_heads parameters for the Style Encoder.

If you get it working, please let us know!

ErnstTmp commented 5 years ago

Thanks, this is great advice, I will try and keep you updated! Just one thing I do not understand 1) Is the wav file in generate.csv defining the style / speaker of the text to speak? 2) Why does a train audio_sound good in tensor flow, but if I supply the same line (from train.csv) via generate.csv, it fails. Does that mean that there is a problem learning the GST?

blisc commented 5 years ago

Inference in Tacotron GST requires 2 inputs: the style wav that we condition on and the text that we want the model to produce. The wav file in generate.csv is the style wav that we are conditioning on (ie the model should produce something that sounds similar to this wav)

The problem is that in train mode, the model uses teacher forcing. At each time step, we supply the previous time step's ground truth spectrogram. Whereas in eval and infer, we supply the model's own predictions.

ErnstTmp commented 5 years ago

Hi

thank you for your suggestions and mentioning teacher forcing - now I see why train audio is good although nothing is generalized.

I tried to increase the hyper parameters as you suggested and trained either with the full German dataset, only with some books from one speaker, or only with a single book from another speaker, but was not successful. In all cases, the eval loss was above 10, showed no improvement, and eval audio was almost like noise.

So I tried to reproduce your results with the mailabs/en_US dataset:

I used a fresh clone of the OpenSeq2Seq repo, run tacotron_gst_combine_csv.py, changed the tacotron_gst.py only slightly (had to decrease the batchsize to 16 since I only have a server with Titan Xs with 12 GB) and set the location to my mailabs/en_US dataset. I run in train_eval mode with approx. 3000 files in eval.csv.

I trained for 32000 steps on 2 GPUs for 2.5 days, and the train loss was below 0.6 and showed nice convergence, but the eval loss stayed around 12 and did not converge. Alignments were not diagonal (only sometimes a bit, but that vanished). Eval audio was not understandable.

Are there any special tricks to converge the mailabs/en_US dataset? Should I try several runs and use the checkpoint from the best only? Or Is there a problem with the smaller batch size - do I need to get bigger GPUs :-))?

BTW - are you or your colleagues at SLT2018 in Athens - I am going there right now ...

Thanks & kind regards

Ernst

blisc commented 5 years ago

For German, I have seen success using "Toten Seelen" read by Eva K using the regular tacotron model without GST.

For both German and en_US, I trained the models on 1 GV100 with a batch size of 32. I would recommend using 32 files for eval.

ErnstTmp commented 5 years ago

Hi blisc, thanks a lot! The batchsize seem to be the problem. I trained on Titan X, and the memory was too small, so I had to reduce batch size. I will retry on V 100 with batch size 32.