improve tts - Githubissues

lethanhson9901 commented 3 years ago

Hi, me again. I'm training your tts. My dataset is about 16 hours First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:

Continue to train your acoustic_checkpoint to 1.46M step: val loss:0.227 and it's gonna converge.
- Here is my result:
Train from scratch: about 800k step - val loss: 0.301

Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step: My transcript text: "xin chào tôi là phương anh bản thử số chín"

I got this : https://drive.google.com/file/d/1UtgE1gTC8mwo1SV1b7chauvWPC7uPjxM/view?usp=sharing => The result that speaker talk non-sense but intonation is quite good.
Here is 50k vocoder + 1.46M acoustic, just to compare: https://drive.google.com/file/d/1InQ8ykYC_P7qaKhv_58SmTC0r-b_4_0h/view?usp=sharing
And from 50k vocoder + 800k from scratch: https://drive.google.com/file/d/1E-FjOfBqFf9vHTKXmAUhamtB2FsAlAMT/view?usp=sharing

I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ? Thanks!

NTT123 commented 3 years ago

It seems to me that you are using a wrong lexicon file when generating speech. The default scripts use the lexicon file assets/infore/lexicon.txt from Infore dataset, when working with your own dataset, you should replace it with your lexicon file.

lethanhson9901 commented 3 years ago

I don't think so, in lexicon file just convert word to characters. I also record audio with text the same to the original dataset

NTT123 commented 3 years ago

In the function, https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L11

we use the lexicon file to compute phoneme set, and use that set to compute phoneme index. A mismatched phoneme set at training and inference will cause problems.

We use this function at inference: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/text2mel.py#L33 and at training: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L56

lethanhson9901 commented 3 years ago

So, I have to train again or replace my lexicon file ?

NTT123 commented 3 years ago

My advice is to use the lexicon file that is used to train your model. Usually, it is at train_data/lexicon.txt

Generate speech with:

python3 -m vietTTS.synthesizer \
  --lexicon-file=train_data/lexicon.txt \
  --text="hôm qua em tới trường" \
  --output=clip.wav

lethanhson9901 commented 3 years ago

I got confused, because when training I'm use your lexicon file (because my audio is the same to the orginal). I already change to your instruction, but result is the same

NTT123 commented 3 years ago

@Lethanhson9901, can you show a few lines in your train_data/lexicon.txt and an example *.textgrid file?

I suspect that there is a mismatch somewhere as your loss, mel-spectrogram seems alright to me.

lethanhson9901 commented 3 years ago

https://drive.google.com/drive/folders/1l2DstG-l77AGvXpQdtayegOfA9X1yIwC?usp=sharing

NTT123 commented 3 years ago

@Lethanhson9901 Everything seems alright to me. I don't think I can help much in this case.

NTT123 commented 3 years ago

@Lethanhson9901 My advice:

train duration (10 minutes) and acoustic model (50 minutes) on InfoRE dataset and generate speech to make sure everything is working correctly (even though the speech is not good but it should be understandable )
train duration (10 minutes) and acoustic model (50 minutes) on Your dataset and generate speech (using your the pretrained hifigan). The speech should be understandable.

lethanhson9901 commented 3 years ago

I'll try. Many thanks !

lethanhson9901 commented 3 years ago

Hey, Thanks god, you're right. And it's work

lethanhson9901 commented 3 years ago

B.t.w, if I want the voice is more natural and better, which part of training I should focus to ? Or pre-process data ? (like denoising, ...) and should I use audio augmentation in training ( by now I don't think it's work)

NTT123 commented 3 years ago

@Lethanhson9901 I'm not sure data augmentation can help. There are few things that can help:

A better dataset:
- Better recording voice: clear voice, correct pronunciation.
- Higher sample rate, 16k -> 22k -> 24k -> 48k
- More data: 20hours -> 40 hours -> 100 hours
- Better text transcripts: cover all Vietnamese phonemes with approximately equally phoneme's frequencies.
- Clean the dataset: make sure text and speech are perfectly matched.
Bigger model:
- If you have more data, use decoder RNNs with 1024 units (currently, it is 512 on InfoRE dataset)
Train acoustic model and Hifigan model longer.
Fine-tune Hifigan model with your acoustic model.

nampdn commented 2 years ago

Hi @NTT123, Could you please give me an instruction on how I can modify the decoder RNNs with 1024 units for my dataset?

NTT123 commented 2 years ago

@nampdn You will need to modify the file vietTTS/nat/config.py by setting acoustic_decoder_dim=1024

nampdn commented 2 years ago

Fantastic, thank you!

NTT123 / vietTTS

improve tts #9