NTT123 / vietTTS

Vietnamese Text to Speech library
MIT License
201 stars 91 forks source link

improve tts #9

Closed lethanhson9901 closed 3 years ago

lethanhson9901 commented 3 years ago

Hi, me again. I'm training your tts. My dataset is about 16 hours First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:

Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step: My transcript text: "xin chào tôi là phương anh bản thử số chín"

I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ? Thanks!

NTT123 commented 3 years ago

It seems to me that you are using a wrong lexicon file when generating speech. The default scripts use the lexicon file assets/infore/lexicon.txt from Infore dataset, when working with your own dataset, you should replace it with your lexicon file.

lethanhson9901 commented 3 years ago

I don't think so, in lexicon file just convert word to characters. I also record audio with text the same to the original dataset

NTT123 commented 3 years ago

In the function, https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L11

we use the lexicon file to compute phoneme set, and use that set to compute phoneme index. A mismatched phoneme set at training and inference will cause problems.

We use this function at inference: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/text2mel.py#L33 and at training: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L56

lethanhson9901 commented 3 years ago

So, I have to train again or replace my lexicon file ?

NTT123 commented 3 years ago

My advice is to use the lexicon file that is used to train your model. Usually, it is at train_data/lexicon.txt

Generate speech with:

python3 -m vietTTS.synthesizer \
  --lexicon-file=train_data/lexicon.txt \
  --text="hôm qua em tới trường" \
  --output=clip.wav
lethanhson9901 commented 3 years ago

I got confused, because when training I'm use your lexicon file (because my audio is the same to the orginal). I already change to your instruction, but result is the same

NTT123 commented 3 years ago

@Lethanhson9901, can you show a few lines in your train_data/lexicon.txt and an example *.textgrid file?

I suspect that there is a mismatch somewhere as your loss, mel-spectrogram seems alright to me.

lethanhson9901 commented 3 years ago

https://drive.google.com/drive/folders/1l2DstG-l77AGvXpQdtayegOfA9X1yIwC?usp=sharing

NTT123 commented 3 years ago

@Lethanhson9901 Everything seems alright to me. I don't think I can help much in this case.

NTT123 commented 3 years ago

@Lethanhson9901 My advice:

lethanhson9901 commented 3 years ago

I'll try. Many thanks !

lethanhson9901 commented 3 years ago

Hey, Thanks god, you're right. And it's work

lethanhson9901 commented 3 years ago

B.t.w, if I want the voice is more natural and better, which part of training I should focus to ? Or pre-process data ? (like denoising, ...) and should I use audio augmentation in training ( by now I don't think it's work)

NTT123 commented 3 years ago

@Lethanhson9901 I'm not sure data augmentation can help. There are few things that can help:

nampdn commented 2 years ago

Hi @NTT123, Could you please give me an instruction on how I can modify the decoder RNNs with 1024 units for my dataset?

NTT123 commented 2 years ago

@nampdn You will need to modify the file vietTTS/nat/config.py by setting acoustic_decoder_dim=1024

nampdn commented 2 years ago

Fantastic, thank you!