Tacotron : inference producing bad results after training

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 369 forks source link

Tacotron : inference producing bad results after training #369

Closed mehulsuresh closed 5 years ago

mehulsuresh commented 5 years ago

Tried to follow instructions to train tacotron on LJ speech dataset. The model converges but I am not able to produce good results in inference for example : https://drive.google.com/open?id=1Gozn18jlzgu0_yja8z9_hJraRlBsnbSa Which is completely incomprehensible.
Training on 4 v100 using the same config file as tacotron_float.py. The only difference is that i am not using horovod. train

Looked at issue #302 and #345 and it seems like the issue was a smaller batch size however i am using the default batch size of 48. Should i train it with a larger batch size? or is the issue with not using horovod? Let me know if there is anything i can try to produce better results.

also the train_audio sounds good in tensorboard

Thanks :)

blisc commented 5 years ago

Train audio will always sound good due to teacher forcing. Train loss is also not a good measure of convergence. The best test for convergence is whether your attention weights are close to diagonal. Please try the 1 gpu config and see if it works. Use all of LJSpeech for training (ie do not make a validation or test split)

mehulsuresh commented 5 years ago

Thanks for your quick response! Where can i see the attention plot? I am using all of the data. Will try running the model as 1gpu and also with the nvidia-docker container and report back.

blisc commented 5 years ago

By default, it should log to tensorboard under images. If it is not in tensorboard, it is probably saved to your logdir as png file

mehulsuresh commented 5 years ago

output_step0_0_infer This is what it looks like

mehulsuresh commented 5 years ago

individualimage

blisc commented 5 years ago

Your plot for alignments is not very good. So it is pretty clear indication that the eval audio will be bad. A good plot for alignments would be a clear diagonal line starting from the bottom right corner. Try changing https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/text2speech/tacotron_float.py#L16 from "both" to "mel"

mehulsuresh commented 5 years ago

Ok.. Tried training with the docker container and was able to get the expected results individualImage (1)

However noticed some issues 1 . After training with mixed precision infer doesnt work because of this error TypeError: Input 'wci' of 'LSTMBlockCell' Op has type float32 that does not match type float16 of argument 'x'.

Tried training the model with docker and very similar config file on much larger dataset from MAILABS and it was not able to produce good results.

Changes to fig files were as follows dataset = "MAILABS" dataset_location = "/var/log/nginx/MAILABS" output_type = "both" FROM dataset = "LJ" dataset_location = "/var/log/nginx/LJSpeech" output_type = "both"
and
batch_size = 55

Trying to train on MAILABS with output_type="mel"

will report back if the results improve.

Attached Config file used british.zip

blisc commented 5 years ago

Regarding MAILABS, Tacotron is a single-speaker model so please ensure that all of your training data is from one speaker. I would even go as far as to only use a single book from a single speaker.

Do you know what the difference is between the docker container and what you were using before the container?

mehulsuresh commented 5 years ago

With MAILABS I'm using only one speaker but two books from the same speaker. After I changed output_type="mel", The network is still training but the results look a lot more promising individualImage (2) only 2500 epochs in. Not sure why changing 'both' to 'mel' makes it perform a better.

The only thing that i feel is different from the old way of training to the docker container is horovod and slightly higher batch size, everything else is pretty much the same.

Also is it not possible to do inference when training it with mixed precision?

mehulsuresh commented 5 years ago

Looks like to only way to train on MAILABS dataset is to use "mel" instead of "both". However after training the inference audio is not as clear as LJspeech trained on "both". Audio_Samples.zip

blisc commented 5 years ago

Have you tried using the same books as the ones published in our documentation? “Jane Eyre” read by Elizabeth Klett “North And South” read by Mary Ann

mehulsuresh commented 5 years ago

Yeah trained it on Jane Eyre by Elizabeth Klett.

blisc commented 5 years ago

If you want, you can try to use a neural vocoder (WaveGlow or WaveNet) to improve your audio quality.