ajinkyakulkarni14 commented 7 years ago

I trained Merlin on Spanish Language, but results are still intangible, though prosody of generated speech is well structured. In data preparation process, for phone alignment I changed language option to 'es' (FESTVOX tool).

Are there any other files need to be changed ? ( Like Question.hed file or any other changes to consider) Any details will be helpful ! Thanks :)

dreamk73 commented 7 years ago

How do you predict the linguistic features referenced in the question file? Do you have a Spanish voice in Festival? Because my guess is that you need Spanish-specific linguistic features to describe the overall structure of the sentences in addition to having language-specific phonemes. What does it sound like at the moment? How many sentences did you use for training? Did you use phone-align or state-align?

ajinkyakulkarni14 commented 7 years ago

Currently, I am preparing the Spanish specific feature file and language specific phonemes. Right now, I trained the system with 200 speech samples using different recipes like BLSTM and DNN. But still results are not improving. Generated Speech is not clear even if intonation is good. I used both phone align and state align for training.

I am still thinking how to modify Question file for Spanish Language !!! or any other step which I should change as per the Spanish Language.

Thanks :)

dreamk73 commented 7 years ago

200 sentences is not a lot for training. But with 200 sentences and a very simple question file (say only a 5-gram of phonemes and the binary phoneme features), the generated sounds should be recognizable and clear, even if the prosody doesn't make sense.

ajinkyakulkarni14 commented 7 years ago

Yes. But I trained sltk_arctic_demo which has 60 training samples and generated speech is clear. So I want to know which files I should replace apart from Question file !

dreamk73 commented 7 years ago

Ok, I am just trying to help you out. What does it sound like? Did you use the existing scripts to get your label_state_align files? Are state begin and end times represented in 5ms increments? That was a problem with our own voice recently, that we created label files but forgot to round the numbers to 5ms. As soon as we fixed that the speech sounded clear.

ajinkyakulkarni14 commented 7 years ago

Thank You for your help :D :) Yes, I used the existing script

Phone alignment lab file content is as given below,

0 2600000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1 2600000 3200000 sil^sil-f+r=a@1_4/A:0_0_0/B:0-0-4@1-2&1-15#1-1$1-1!0-0;0-0|0/C:0+0+3/D:0_0/E:content+2@1+8&1+7#0+1/F:content_2/G:0_0/H:15=8@1=1|NONE/I:0=0/J:15+8-1

is it correct ?

In state alignment .lab file there are no timestamps, only in phone_alignment. I am attaching sample files here Sample_files_merlin.zip

dreamk73 commented 7 years ago

Your phone alignment file looks ok as far as I can tell. When you set your acoustic conf file to phone_align does the output not sound clear either? Did you set subphone_features to None when using phoneme labels only?

The state alignment file needs to look like this, so for each phoneme you need to have 5 states. The forced alignment scripts will create these for you if you follow their setup (see merlin/scripts/alignment/state_align):

0 500000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1[2] 500000 1000000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1[3] 1000000 1500000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1[4] 1500000 2000000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1[5] 2000000 2600000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1[6]

On Thu, Feb 23, 2017 at 11:27 AM, Ajinkya Kulkarni <notifications@github.com

wrote:

Thank You for your help :D :) Yes, I used the existing script

Phone alignment lab file content is as given below,

0 2600000 x^x-sil+sil=f@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/ C:0+0+0/D:0_0/E:x+x@x+x&x+x#x+x/F:0_0/G:0_0/H:x=x@1=1|0/I:0=0/J:15+8-1 2600000 3200000 sil^sil-f+r=a@1_4/A:0_0_0/B:0-0-4@1-2 &1-15#1-1$1-1!0-0;0-0|0/C:0+0+3/D:0_0/E:content+2@1+ 8&1+7#0+1/F:content_2/G:0_0/H:15=8@1=1|NONE/I:0=0/J:15+8-1

is it correct ?

In state alignment .lab file there are no timestamps, only in phone_alignment. I am attaching sample files here Sample_files_merlin.zip https://github.com/CSTR-Edinburgh/merlin/files/796180/Sample_files_merlin.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CSTR-Edinburgh/merlin/issues/98#issuecomment-281954632, or mute the thread https://github.com/notifications/unsubscribe-auth/ASbibCEvjh2BG9JC5phBC3MAIzW_tGPMks5rfV8IgaJpZM4MJsSj .

ajinkyakulkarni14 commented 7 years ago

Ohh ! I didnt set subphone_feat to None. Furthermore, Now I will change state label files and after re training I will see the changes in outputs !! I will post results ! Thanks alot for your important comments ! :)

ronanki commented 7 years ago

You can configure this option in global_settings.cfg and then re-generate your conf files -- which automatically sets all the options.

When you are using phone_align labels, you can set subphone_feat to coarse_coding -- I recommend generating conf files by changing global_settings.cfg and see the changes in conf files.

dhm42 commented 7 years ago

@ajinkyakulkarni14 Can you please tell us if you suceeded to generate a clearer Speech. I am working on new language and I have the same problem as you. I have phone aligned labels files with the corresponding question file. I used 60 then 120 sentences but I still don't see improvement.

nshmyrev commented 7 years ago

120 utterances is a tiny amount. Good quality voice requires 4000+ utts. It should not be a problem to get them, just take an audiobook.

dhm42 commented 7 years ago

@nshmyrev Thank you for your answer. I just want to get a result close to the result obtained with the slt_arctic_demo with 50 sentences with phone_align. But my result I get with my 120 sentences is not comparable and not clear at all. you can find some examples here: https://www.dropbox.com/sh/ln43z57d41ff1p0/AACkKwz8HKV1R1o6JOE4oiCua?dl=0

nshmyrev commented 7 years ago

@dhm42 I do not hear speech in your samples, its just plain buzz with intonation. Means you did something totally wrong, it does not matter how many utterances you have.

dhm42 commented 7 years ago

I did the following steps:

generated acoustic features with WORLD integrated in Merlin and placed them in acoustic & duration models directories (I verified that wav files are 16Khz),
Changed the question files,
Placed the labels files phone_aligned in the appropriate folders,
changed the global config file & generated the other configuration files.

Is there any step I need to do ? Thanks

ronanki commented 7 years ago

Use copy-synthesis script on few audio files to see if extracted features are correct.
If you are able to re-generate with good quality, use WORLD feature extraction script to extract vocoder parameters.
At the moment, we are not providing any scripts on how to train aligner for other international languages.
But, if you have question file and label files -- you can use them with merlin, in the same way as English. But double-check your alignments using wavesurfer.

Can you provide couple of audio files and their corresponding labels?

dhm42 commented 7 years ago

I used copy-synthesis and the generated audio has good quality as you can find here: https://www.dropbox.com/sh/m6kh3g74jcvlvpl/AACe6kZRLPr2S_4NF4Sr1k2la?dl=0 You can find examples of audio and corresponding phone aligned labels here: https://www.dropbox.com/sh/33fazlv6v1enryg/AACeyhDlIymIEJRTBZKtWmiNa?dl=0

Thank you,

ronanki commented 7 years ago

Your labels are not so accurate -- probably you used very less data even for alignment, which is not a good thing. Use at least 1 hour of data for alignment. Later for synthesis, you can vary the training data.
You need to change silence pattern in addition to what you have done: silence_pattern: ['-sil+'] --> silence_pattern: ['-#+'] in all conf files after generating using global_settings.cfg i.e., replace 'sil' with '#' as your labels use '#' to represent silence.
Fine-tune the network: reduce the learning rate if training converges within 5 epochs i.e., if it continues to show that validation error is increasing right from 5th epoch. Also, use LSTM if labels are not so accurate since sequence learning can be beneficial in such cases.
You can simply run acoustic model alone and see the generated output -- also, note that the demo and full voice scripts in slt_arctic use all steps in one go. So, make sure that you comment those steps related to generating conf files -- otherwise, the changes you made will be replaced with default values.
Finally, I recommend training with 1 hour of data -- not all languages can generate decent output with 50/100 sentences. If V/UV error is beyond 10, you won't be able to hear clear speech when trained on short amount of data.

ajinkyakulkarni14 commented 7 years ago

@dhm42 and @ronanki I am sharing my results as given below,

Trained system with 4hr of Speech samples
Used State_Label_align and subphone set to none
Network Configuration :

hidden_layer_size: [1024, 1024, 1024, 1024, 1024, 1024] hidden_layer_type: ['TANH', 'TANH', 'TANH', 'TANH', 'TANH', 'TANH']

if RNN or sequential training is used, please set sequential_training to True.

sequential_training : False dropout_rate : 0.0

learning_rate : 0.002 batch_size : 50 output_activation: linear warmup_epoch : 5 warmup_momentum : 0.3

training_epochs : 250

No changes in Question File
Results on Test data https://www.dropbox.com/sh/p9ubxhdsmw989eq/AAB_F7mqRAmYMOPHEujc77Xta?dl=0
Results on External News text https://www.dropbox.com/sh/rgddib47pdjzdtz/AAA2FfiVxY8ZTy4vWHGmmPdga?dl=0

Question :

Q1. @dhm42 your results seems to be promising, what specific changes you made in Question file (Which are related language) ???

Q2. @ronanki there is some background echo noise in output, Is it because training is not converged OR Can you suggest any suggestion for improvements ?

Thank You ! :)

dhm42 commented 7 years ago

@ajinkyakulkarni14 Thank you for your answer. I used a completely different question file for French language (corresponding to my labels files) from HMMs trained by HTS, it contains more than 1800 line. Keeping the exact same configuration as artic_demo (training 50, valid 5 and test 5 sentences). I slightly increased the learning rate to 0.003 and I saw some improvement, as you can find here: https://www.dropbox.com/sh/24j3rfrb6j59u4m/AACBKS0jk6LB7XMwJGEMmmbpa?dl=0 Now I am trying to improve the alignment (as suggested by ronanki) to see if I can get better results (even if I didn't see improvement with LSTM). I may test also with a bigger number of sentences. I will keep you informed if I can get better quality results.

CSTR-Edinburgh / merlin

Language dependent components in Merlin #98

if RNN or sequential training is used, please set sequential_training to True.