chazo1994 commented 7 years ago

I'm training TTS model with DNN (6 tanh). I trained voice1: with 8565 utterances and get good quality with follow result: 2017-09-29 06:48:39,443 INFO main: calculating MCD 2017-09-29 06:48:45,024 INFO main: Develop: DNN -- MCD: 5.433 dB; BAP: 0.166 dB; F0:- RMSE: 32.575 Hz; CORR: 0.818; VUV: 19.215% 2017-09-29 06:48:45,024 INFO main: Test : DNN -- MCD: 5.209 dB; BAP: 0.158 dB; F0:- RMSE: 29.696 Hz; CORR: 0.848; VUV: 15.699%

But when I change to train voice2 with 5200 utterances and get very very very bad quality: 2017-09-23 19:48:56,214 INFO main: calculating MCD 2017-09-23 19:49:02,468 INFO main: Develop: DNN -- MCD: 8.389 dB; BAP: 0.143 dB; F0:- RMSE: 29.684 Hz; CORR: 0.744; VUV: 13.385% 2017-09-23 19:49:02,469 INFO main: Test : DNN -- MCD: 8.047 dB; BAP: 0.124 dB; F0:- RMSE: 27.670 Hz; CORR: 0.735; VUV: 13.203% with voice2's result, I can not hear explicit speech.

Note: two voice data take from difference speaker.

so, my question: when I change voice data of difference speaker, what parameter (or architecture ) need to be changed? And How can i improve the second model from voice2?

//P/s : I checked vocoder parameter and it's ok. I also extract features from wav file with WOLRD vocoder and resynthesis it with same parameter value of trainning config and it's ok. When I replace mgc file of result of voice2 model with mgc file's generated from original wav file by WORLD and then synthesis these parameter file (lf0,bap,mfc replaced), I get bad wav result same like result of voice2 model.

ronanki commented 7 years ago

The high values of VUV in both voices indicate that the extracted F0 is not consistent. Therefore, you should consider replacing F0 extracted by WORLD with some other F0 (for eg., REAPER) -- we plan to do this, but it may take time considering other things in the pipeline.

However, MCD is also quite worse in the voice2 -- which means, there must be something wrong with either MGC params or label files (incorrect alignments). Also, please check training errors and at which epoch, did it converge?

dreamk73 commented 7 years ago

I assume these two voices are female given the F0 RMSE value. I would also suggest another F0 extraction to get lower VUV errors. I have used Reaper, but Swipe gets even lower errors for me ( https://github.com/kylebgorman/swipe).

But given that the second voice does not sound like speech suggests something else is going on. Are both voices using the same sample rate? How are your alignments? And how did you create your label files? Are both voices using the same phone set?

On Thu, Oct 5, 2017 at 2:40 PM Srikanth Ronanki notifications@github.com wrote:

The high values of VUV in both voices indicate that the extracted F0 is not consistent. Therefore, you should consider replacing F0 extracted by WORLD with some other F0 (for eg., REAPER https://github.com/google/REAPER) -- we plan to do this, but it may take time considering other things in the pipeline.

However, MCD is also quite worse in the voice2 -- which means, there must be something wrong with either MGC params or label files (incorrect alignments). Also, please check training errors and at which epoch, did it converge?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSTR-Edinburgh/merlin/issues/258#issuecomment-334452507, or mute the thread https://github.com/notifications/unsubscribe-auth/ASbibJRNQ_jNOd-bl_KYfLB189FcUuj4ks5spM5IgaJpZM4Pu-d4 .

chazo1994 commented 7 years ago

Thank @ronanki @dreamk73 . I don't know how to integrate Swipe or REAPER to Merlin. I will try to do that in future. The main problem is quality of voice2, I tested WOLRD vocoder with voice2, and it's ok.but, I replace mgc file with mgc file generated by voice2 Model, the result is very bad (same like wav generated by voice2's model), this thing is not happen with other file like lf0 or bap.

About the aligment

I used state_align scripts in "/misc/scripts/alignment/state_align" with label_no_aligments file (my label_no_aligments file work ok with HTS). Two voice I use same config params like this: [Outputs]

mgc: 60 dmgc: 180 bap: 1 dbap: 3 lf0: 1 dlf0: 3

[Waveform] test_synth_dir: None

vocoder_type: WORLD samplerate: 16000 framelength: 1024

fw_alpha: 0.58 minimum_phase_order: 511 use_cep_ap: True

Both voice is 16khz sample rate,. "Also, please check training errors and at which epoch, did it converge?"

My log

My log in training Duration model of voice2: 2017-09-22 23:08:49,199 INFO main.train_DNN: epoch 18, validation error 3.824286, train error 3.713300 time spent 364.45 2017-09-22 23:08:49,199 DEBUG main.train_DNN: validation loss increased 2017-09-22 23:08:49,199 DEBUG main.train_DNN: training params -- learning rate: 0.000004, early_stop: 5/5 2017-09-22 23:14:57,923 DEBUG main.train_DNN: calculating validation loss 2017-09-22 23:15:05,203 INFO main.train_DNN: epoch 19, validation error 3.825058, train error 3.711679 time spent 376.00 2017-09-22 23:15:05,203 DEBUG main.train_DNN: validation loss increased 2017-09-22 23:15:05,203 DEBUG main.train_DNN: stopping early 2017-09-22 23:15:05,203 INFO main.train_DNN: overall training time: 120.58m validation error 3.822408

My log in trainning Acoustic model of voice2: 2017-09-23 18:18:48,595 INFO main.train_DNN: epoch 15, validation error 188.635559, train error 146.233063 time spent 4154.11 2017-09-23 18:18:48,596 DEBUG main.train_DNN: validation loss increased 2017-09-23 18:18:48,598 DEBUG main.train_DNN: training params -- learning rate: 0.000031, early_stop: 13/5 2017-09-23 19:26:30,294 DEBUG main.train_DNN: calculating validation loss 2017-09-23 19:29:13,999 INFO main.train_DNN: epoch 16, validation error 188.372284, train error 146.461548 time spent 4225.40 2017-09-23 19:29:14,001 DEBUG main.train_DNN: stopping early 2017-09-23 19:29:14,002 INFO main.train_DNN: overall training time: 1164.15m validation error 175.042419 =>>> what wrong here? both of training model is stop early!

Note: my language is tonal language with 6 tone.

bajibabu commented 7 years ago

There is nothing in wrong in stopping early, model didn't get better after few epochs and it thinks that better to stop. Your model trained more than 15 epochs, so it's not problem in my experience.

Looks like hard to find what is the cause of your problem. I would much appreciate if you can provide us 1 or 2 wave files and corresponding label files of both voices?

chazo1994 commented 7 years ago

@bajibabu : in acoustic model: early_stop: 13/5. it's not ok. Actually in training cannot stop before 15 epochs, it allways reach 15 epoch because: in run_merlin.py have a conditional statement that won't allow stop before 15 epoch.

chazo1994 commented 7 years ago

@bajibabu this is my wav and corresponding label files of both voices: https://drive.google.com/file/d/0B3e1eh4_fbjQTk1NbkVwcWRKdGc/view detail in readme.txt: Model voice1: trained from 8565 untterances Model voic22: trained from 5200 untterances

In voice1 folder and voice2 folder:

gen_wav: contain wav files are generated from each corresponding models.
labs: contain original labels file.
labs_after_gen_duration: contain labels file are generated from each corresponding duration models.
original_wav: contain original wav file.

In gen_wav_from_same_text folder: contain wav files and lab files are generated from both model with same input text:

labs: contain original labels file.
labs_after_gen_duration_voice1: contain label file are generated from duration model of voice1.
labs_after_gen_duration_voice2: contain label file are generated from duration model of voice2.
wav_voice1: contain wav file is generated by voice1 model.
wav_voice2: contain wav file is generated by voice2 model.

dreamk73 commented 7 years ago

I don't understand you when you say you replace mgc files from voice 1 with voice 2? The lf0, bap and label files will be different so you can't just mix and match from two different speakers like that.

I am surprised voice 1 gets a better result than voice 2. When I listen to the audio from voice 1, it sounds like it was recorded with a very low quality microphone and the speaker very far away from the microphone. The recordings from voice 2 sound much clearer. How were these recordings obtained? But when I listen to the generated wavs, I do hear that voice 2 has no clear phoneme identities so somehow your label files / parameter files for voice 2 do not describe the data well.

How were the recordings segmented, did you use HTK to obtain phoneme boundaries? Did you check how accurate they were? I would start with a very minimal question file with just the five phoneme contexts and train models and see if you can understand the sounds generated by the models. If there is something wrong there, then you know it has to do with your phoneme set or phoneme boundaries. If not, you can add some more features back in until you get to the point where it turns to unintelligible speech.

chazo1994 commented 7 years ago

@dreamk73 Thank so much. I said, I replace mgc file of Voice 2. It mean, when i test world vocoder, I extracted speech's feature of a original voice file by world vocoder (like f0,sp,ap and convert to mgc,bap,lf0 by SPTK) and synthesis speech these features (by world vocoder and SPTK). After that I replace mgc file that generated by DNN model and synthesis again to compare the differences. the result show that lf0,bap file generate good voice but mgc file generated bad voice.

With voice2 my label no aligned is ok, because I trained it with HTS before but I'm not sure about label aligned file. Of course, I use HTK to obtain phoneme boundaries. Actually, Merlin has a script to align label file, it's base on HTK.

"If there is something wrong there, then you know it has to do with your phoneme set or phoneme boundaries" => ok I understand, I will try to generate phoneme boundaries by other way base on HTK.

"If not, you can add some more features back in until you get to the point where it turns to unintelligible speech" => I don't understand.

dreamk73 commented 7 years ago

Ok, so if you just do copy synthesis with the WORLD vocoder, does it sound ok? If you extract the acoustic features files and then synthesize them again, is that ok?

What I meant to say when debugging, is (1) inspect your label files with some waveform editor to see how the boundaries line up with the audio. Do they make sense? and (2) create a very minimal question file with just the phoneme questions C-, LL-, L-, RR-, R-. If the generated wavs have clear phonemes and the training doesn't stop too soon, then your acoustic features are ok but something else is wrong in your question file. If the generated wavs have no clear phones, there is something else wrong with your acoustic features or label files.

chazo1994 commented 7 years ago

@dreamk73 ok. thank you so much. I will try to check label alignment and question file to figure out what wrong. "Ok, so if you just do copy synthesis with the WORLD vocoder, does it sound ok? If you extract the acoustic features files and then synthesize them again, is that ok?" => when i do that, it's ok.

chazo1994 commented 6 years ago

@dreamk73 Did you know about replace f0 estimator of world vocoder by reaper ??? I try to do it but get "core dump" exception" in synthesis phase.

dreamk73 commented 6 years ago

I have no idea why it is not working for you. The latest version of the Merlin script now allows you to extract F0 using reaper so it should just work.

Jackiexiao commented 6 years ago

wrong or not accurate alignment could cause higher MCD, like dreamk73 said, inspect your label files with some waveform editor to see how the boundaries line up with the audio. Do they make sense?
check Question file by inspecting file: acoustic_model/inter_module/label_norm_HTS_xxx.dat see minmax value of the vector

CSTR-Edinburgh / merlin

Get bad quality after change voice data (with same config and good quality of data) #258

About the aligment

My log