NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.03k stars 1.37k forks source link

Model can not converge #254

Closed HiiamCong closed 4 years ago

HiiamCong commented 5 years ago

Hello, I have a question.

I'm using a dataset with > 17k sentences (about 30 hours audio), 90% for training and 10% for validating. It's been training for 3 days (using batch_size 8) and reaching Epoch 56. Plz see training info below [Grad norm] image [Training loss] image [Validation loss] image I thought it looks good. But when I tested it, the output audio was wrong and Attention looks awful. image

And the loss seems can not decrease any more. Do I have to train for more Epoch or there was something wrong with my dataset, or something else? Plz help me, thank u guys so much.

wizardk commented 5 years ago
  1. Attention with n_frame_per_step = 1 is hard to converge
  2. Convergence of attention needs more time
  3. Adding EOS will help and accelerate convergence of attention
terryyizhong commented 5 years ago
  1. n_frame_per_step

does this repo support n_frame_per_step larger than 1 now?

terryyizhong commented 5 years ago

same problem, I suppose it's because the batch_size.

HiiamCong commented 5 years ago

@wizardk thank u for your advice, I'll wait for some more Epochs to see what happen.

HiiamCong commented 5 years ago

@terryyizhong yeah, I do think so, actually, batch_size 8 is pretty small. Unfortunately, My RTX 2080Ti can not run with batch_size more than 10, I'm not sure why. Maybe the input audio durations are too long. (all of my audios < 20s) Btw, when I trained the model with about 4k sentences, the Attention converges perfectly.

terryyizhong commented 5 years ago

@HiiamCong Did you mean your Attention can converges perfectly with 4k sentences though the batch_size is 8 ?

HiiamCong commented 5 years ago

@terryyizhong Yes, and the output audio is not perfect but understandable. To get smoother output audio, I try to increase training data to 17k sentences and then getting this Attention problem.

hadaev8 commented 5 years ago

Batch size very important, i was able to coverage on 16 only then used transfer learning from english model.

HiiamCong commented 5 years ago

@hadaev8 I wanna use tacotron2 for another language so I have to train the model from scratch. Do you have any idea to increase batch size?

hadaev8 commented 5 years ago

English checkpoint for russian dataset works pretty well. You may drop too long audios from dataset (ljspeech max length is 10 secs) or use google colab with 15gb t4 gpu.

terryyizhong commented 5 years ago

thx for the info. btw, I failed learn alignment using another language dataset at first. And success after change the decoder dropout from 0.1 to 0.5 (used by another taco2 repo), with batch size 48

HiiamCong commented 5 years ago

@terryyizhong p_attention_dropout=0.1, p_decoder_dropout=0.1, how about attention dropout ? Did you also increase it?

wizardk commented 5 years ago
  1. n_frame_per_step

does this repo support n_frame_per_step larger than 1 now?

Check this: https://github.com/BogiHsu/Tacotron2-PyTorch

wizardk commented 5 years ago

@wizardk thank u for your advice, I'll wait for some more Epochs to see what happen.

Finally, you should get alignment like this: align

terryyizhong commented 5 years ago

Check this: https://github.com/BogiHsu/Tacotron2-PyTorch @wizardk thx for the link, have you tried this repo? how is the performance compare to this repo?

terryyizhong commented 5 years ago

@terryyizhong p_attention_dropout=0.1, p_decoder_dropout=0.1, how about attention dropout ? Did you also increase it?

@HiiamCong No, I'm still doing experiment about the dropout. And I found the format of punctuation is very important to learn attention. After clean the spaces before some punctuations, the attention alignment starts looks better though still not converge after 100k.

HiiamCong commented 5 years ago

@terryyizhong Have you tried to decrease the learning rate?

terryyizhong commented 5 years ago

@terryyizhong Have you tried to decrease the learning rate?

@HiiamCong yeah, I am using exponential learning rate decay. But I am now wondering should I decrease the learning rate before the attention converge. Or should I "increase" the initial learning rate for the attention learning

terryyizhong commented 5 years ago

here are my attention plots and loss curve at step 130k. I am using private english data, 5 hours(totally 3000 sents, including 800 sentences from lj recorded in other voice) batch 32, other parameters are the same as default.
image image image image

I think the plot is becoming diagnal. But the plot didn't change much in recent 30k. Any suggestion about learn the alignment? btw, I learn the alignment success using LJspeech dataset in 30k steps using the same params. @rafaelvalle

HiiamCong commented 5 years ago

@terryyizhong Thank for your information. Btw how did you implement exponential learning rate decay with this nvidia's tacotron code? I can not find these settings in hparams.

terryyizhong commented 5 years ago

@terryyizhong Thank for your information. Btw how did you implement exponential learning rate decay with this nvidia's tacotron code? I can not find these settings in hparams.

I just add code like: learning_rate = init_lr * (0.01 ** (epoch / 1000.0)) in the main loop of train.py

Clement-Hui commented 5 years ago

batch size 8 is good. model converages perfectly using this batch size I used a 3-4 hour dataset without any punctuation and pretrained english model and the model converages around 5k-10k steps, without problem. I set the dropout of attention to be 0.4 for both. no exponential learning rate decay. The model started overfitting around 10k steps. The speech is perfectly understandable. The dataset is a cantonese dataset, without any tone label.

terryyizhong commented 5 years ago

batch size 8 is good. model converages perfectly using this batch size I used a 3-4 hour dataset without any punctuation and pretrained english model and the model converages around 5k-10k steps, without problem. I set the dropout of attention to be 0.4 for both. no exponential learning rate decay. The model started overfitting around 10k steps. The speech is perfectly understandable. The dataset is a cantonese dataset, without any tone label.

How can you train a Cantonese model use a pretrained english model?

Clement-Hui commented 5 years ago

Oh. I use jyutping which simply translates Cantonese unicode characters to alphabets and numbers. That means "你好" which means hello get converted to nei5 hou2. Also, I added Eos and symbols that indicate words in the string of input in a second training

terryyizhong commented 5 years ago

Oh. I use jyutping which simply translates Cantonese unicode characters to alphabets and numbers. That means "你好" which means hello get converted to nei5 hou2. Also, I added Eos and symbols that indicate words in the string of input in a second training

@Clement-Hui Thanks for your reply! I am surprise it works that way. Cause I thought the character embeddings of these two language are different. Besides, change the dropout rate of attention to 0.4 helps a lot! I am now get a better attention alignment in only 13k, hope it will converge successfully! image

Clement-Hui commented 5 years ago

Oh. I use jyutping which simply translates Cantonese unicode characters to alphabets and numbers. That means "你好" which means hello get converted to nei5 hou2. Also, I added Eos and symbols that indicate words in the string of input in a second training

@HiiamCong Thanks for your reply! I am surprise it works that way. Cause I thought the character embeddings of these two language are different. Besides, change the dropout rate of attention to 0.4 helps a lot! I am now get a better attention alignment in only 13k, hope it will converge successfully!

The model uses the --warm start parameter and remove the embedding layer, so it can be retrained. The symbols.py is mostly the same, but only include a-z and 1-6 and spaces + punctuation. Also, I am training the model by considering the structure in 3 pieces and treating the Cantonese character in three parts, instead of alphabets. It took a longer time to coverage, as it is quite different from English, but it is converging.

3200 steps 3200

5300 steps 5300

14100 steps 14100

21400 steps 21400

42900 steps 42900

HiiamCong commented 5 years ago

batch size 8 is good. model converages perfectly using this batch size I used a 3-4 hour dataset without any punctuation and pretrained english model and the model converages around 5k-10k steps, without problem. I set the dropout of attention to be 0.4 for both. no exponential learning rate decay. The model started overfitting around 10k steps. The speech is perfectly understandable. The dataset is a cantonese dataset, without any tone label.

Thank for your useful information, btw what was your validation loss when the model converged?

Clement-Hui commented 5 years ago

It was around 0.4 and increased to 4.8 after 10k more steps. The output audio haven't changed in quality

chazo1994 commented 4 years ago
  1. Attention with n_frame_per_step = 1 is hard to converge
  2. Convergence of attention needs more time
  3. Adding EOS will help and accelerate convergence of attention

I'm sorrry but can you help me to explain what is EOS and how to add it into source code. Please!

Clement-Hui commented 4 years ago

EOS stands for End of sentence. You can use some symbol such as semicolon to represent EOS. Just append the symbol at the end of each line.

rafaelvalle commented 4 years ago

Closing due to inactivity.

sabat84 commented 2 years ago

@rafaelvalle Hi, I wanted to train tacotron 2 from scratch with 4652 sentences (Kurdish dataset) (10 hours), batch size 32. here are some plots: Capture Capture1 does this model are going to be converge or not? please help