Get better quality when training a new voice?

licktion commented 7 years ago

I'm trying to train a new voice using Merlin. But now, the generated voice doesn't sound very natural. Could anyone kindly provide some hints about how I can improve the quality of the voice (e.g., which parameters I need to adjust in training the model, do better alignment, etc.)? BTW, my final acoustic model training error is around 177. Is it normal or still too high? Thanks in advance!

ronanki commented 7 years ago

Please provide final MCD and RMSE scores.

licktion commented 7 years ago

@ronanki Thanks for your reply! calculating MCD Develop: DNN -- MCD: 6.951 dB; BAP: 0.252 dB; F0:- RMSE: 59.096 Hz; CORR: 0.411; VUV: 35.246% Test : DNN -- MCD: 8.186 dB; BAP: 0.247 dB; F0:- RMSE: 45.744 Hz; CORR: 0.364; VUV: 14.844%

bajibabu commented 7 years ago

Can you tell how many utterances are in develop and test set?

licktion commented 7 years ago

@bajibabu Thanks! Train=3000 Valid=107 Test=107 The average length of utterances is around 13s.

licktion commented 7 years ago

The architecture I used is: hidden_layer_size: [1024, 1024, 1024, 1024, 384] hidden_layer_type: ['TANH', 'TANH', 'TANH', 'TANH', 'BLSTM']

bajibabu commented 7 years ago

Strange! You got very low F0 and VUV error values on test set than develop set. I didn't encounter this kind of results in my experiments...

licktion commented 7 years ago

Oh, it is supposed to be higher than dev set? The final validation error is 209, which is higher than train error.

bajibabu commented 7 years ago

In merlin, the network directed by the develop set error and it never sees the test set in training. Thus, the error on test set supposed be be higher or equal (in best case) than develop set. But in your case I am seeing mixed results :|

zhizhengwu commented 7 years ago

@licktion can you post the training and validation errors per epoch?

licktion commented 7 years ago

@bajibabu Yes, I agree that the test set doesn't appear during training process. Actually the voice was coming from different speakers, but the test set was only from one speaker. Is this probably the reason? But I don't think this influences the training process much.

@zhizhengwu Thank you! BTW, I disabled early stop.

epoch 1, validation error 211.265839, train error 181.078751 time spent 8650.93 epoch 2, validation error 210.860306, train error 179.878830 time spent 8662.58 epoch 3, validation error 210.582703, train error 179.460205 time spent 8662.85 epoch 4, validation error 210.379105, train error 179.158936 time spent 8665.37 epoch 5, validation error 210.210251, train error 178.941681 time spent 8665.97 epoch 6, validation error 210.077393, train error 178.781006 time spent 8664.16 epoch 7, validation error 209.961990, train error 178.607132 time spent 8673.28 epoch 8, validation error 209.862061, train error 178.487793 time spent 8547.55 epoch 9, validation error 209.792206, train error 178.422043 time spent 8544.03 epoch 10, validation error 209.684082, train error 178.317108 time spent 8442.68 epoch 11, validation error 210.245239, train error 178.780334 time spent 8449.64 epoch 12, validation error 209.507721, train error 178.201080 time spent 8509.56 epoch 13, validation error 209.206833, train error 177.857956 time spent 8588.61 epoch 14, validation error 209.042862, train error 177.679962 time spent 8578.24 epoch 15, validation error 208.979248, train error 177.572327 time spent 8389.32 epoch 16, validation error 208.945328, train error 177.516541 time spent 8525.13 epoch 17, validation error 208.935669, train error 177.482834 time spent 8469.65 epoch 18, validation error 208.937714, train error 177.466873 time spent 8406.04 epoch 19, validation error 208.932465, train error 177.457626 time spent 8616.94 epoch 20, validation error 208.930527, train error 177.450836 time spent 8490.71 epoch 21, validation error 208.929184, train error 177.447876 time spent 8480.22 epoch 22, validation error 208.929337, train error 177.446335 time spent 8549.86 epoch 23, validation error 208.929382, train error 177.445709 time spent 8385.04 epoch 24, validation error 208.929398, train error 177.445358 time spent 8384.29 epoch 25, validation error 208.929428, train error 177.445160 time spent 8382.64

bajibabu commented 7 years ago

@licktion Can you share 1 or 2 samples of your data, if that is OK with you.

zhizhengwu commented 7 years ago

your network is converging, but very slow. What is your learning rate?

Can you share a label file and the question file?

licktion commented 7 years ago

@bajibabu For sure. The attachment includes one sample input and one output. Sorry I have some problem with uploading zips. Please change the extension to .zip manually.

ronanki commented 7 years ago

Did you set sequential_training variable to True in conf files?

bajibabu commented 7 years ago

I listened the samples. Your training samples are in good quality. How many voices are in your training data? By listening output sample, your network biased toward children voice. How many children are in your data?.

licktion commented 7 years ago

@zhizhengwu Agree. As noticed, it is very slow. This learning rate is 0.002. One thing I am doing now is increasing the learning rate to 0.005 to train the model again. But I'm wondering if this is still small. First several results from the new learning rate I got is: epoch 1, validation error 210.908356, train error 180.628754 time spent 8449.61 epoch 2, validation error 210.440292, train error 179.477646 time spent 8407.66 epoch 3, validation error 210.188522, train error 179.035568 time spent 8402.30 epoch 4, validation error 210.006149, train error 178.754547 time spent 8497.85 epoch 5, validation error 209.828201, train error 178.549194 time spent 8636.84

This is one sample label file from phone alignment. Please change the extension to .lab manually. input sample.pdf

Sorry, what is the question file?

licktion commented 7 years ago

@ronanki Yes, I am pretty sure I did that.

input_output.pdf

This is the input and output voices. Please change the extension to .zip manually.

BTW, do you have any comment on MCD and RMSE scores? Do they look normal?

licktion commented 7 years ago

@bajibabu Sorry I uploaded the wrong voice... input_output.pdf I cannot upload zip... I don't know why. I used 3000 in my training.

ronanki commented 7 years ago

@licktion You need to check lot of things while building voice.

Use copy-synthesis script on few audio files to see if extracted features are correct.
If you are able to re-generate with good quality, then use WORLD feature extraction script to extract vocoder parameters.
Double-check your alignments using either audacity/wavesurfer.

Apart from that, your high VUV error suggests that either alignments are not properly done or the method to extract F0 in WORLD is not upto the mark on your database. Consider changing F0 extraction algo.

dreamk73 commented 7 years ago

The question filename is set in your conf/global_settings.comf file and the question file is stored in merlin/misc/questions. In it are the features that are present in your label_state_align files and the separators used to find individual features. It uses that to process the linguistic features and normalize them to binary features. So if there is a mismatch between the feature string in your label_state_align files and the question file, your features are not normalized correctly.

simonkingedinburgh commented 7 years ago

One debug technique is to look at the min and max value of each binary feature: if min==max then that feature is constant for all frames (across your training set, say) which means it carries no information - this is an indication that it is not being extracted correctly from the label name by the question set.

This technique will find questions that are not matching any patterns in your labels. But it won't directly find elements in your label names (e.g., new features you have added) that are not being queried by any question.

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

licktion commented 7 years ago

@ronanki @dreamk73 @simonkingedinburgh Thanks! I will go back to check alignment and feature extraction prior to any parameter tuning in the training. Once I get some results, I will show them here.

@zhizhengwu Here is one label file and the question file provided in merlin/misc/questions. Do you have any comment? questions-radio_dnn_416.pdf (extension .hed) one label.pdf (extension .lab)

zhizhengwu commented 7 years ago

@licktion it looks like you are using the default question file and the English labels extracted by Festival. I did not see any problem there. As Simon and Srikanth suggested, please check your input and output features. copy-synthesis will let you know whether the extracted features are correct or not.

It sounds like the synthesized voice has a higher F0 than the original one.

licktion commented 7 years ago

@zhizhengwu Yes, I did. I used the default phone alignment provided in merlin. I noticed that during phone alignment, there were many warnings showing NAN problems. I guessed that might be one reason. But what I'm not quite sure is, if there is some alignment problem. which factors may cause this? For example, should I avoid any big pause in my wav files? Or something like this I need to notice. I added longer silence time before and after one utterance because I remember short silence time may cause alignment problem. But I may overlook some other things. I tried to re-synthesize the voice. It is not exactly same as the original one, but the quality is not bad. F0 from the synthesized voice not same as the sample voice is caused by that I used different voices from different speakers in my training. Thank you very much!

CSTR-Edinburgh / merlin

Get better quality when training a new voice? #121