Closed lethanhson9901 closed 3 years ago
It seems to me that you are using a wrong lexicon file when generating speech.
The default scripts use the lexicon file assets/infore/lexicon.txt
from Infore dataset, when working with your own dataset, you should replace it with your lexicon file.
I don't think so, in lexicon file just convert word to characters. I also record audio with text the same to the original dataset
In the function, https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L11
we use the lexicon file to compute phoneme set, and use that set to compute phoneme index. A mismatched phoneme set at training and inference will cause problems.
We use this function at inference: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/text2mel.py#L33 and at training: https://github.com/NTT123/vietTTS/blob/346d46798ab066014b64d971be0735f3f3019703/vietTTS/nat/data_loader.py#L56
So, I have to train again or replace my lexicon file ?
My advice is to use the lexicon file that is used to train your model. Usually, it is at train_data/lexicon.txt
Generate speech with:
python3 -m vietTTS.synthesizer \
--lexicon-file=train_data/lexicon.txt \
--text="hôm qua em tới trường" \
--output=clip.wav
I got confused, because when training I'm use your lexicon file (because my audio is the same to the orginal). I already change to your instruction, but result is the same
@Lethanhson9901, can you show a few lines in your train_data/lexicon.txt
and an example *.textgrid file?
I suspect that there is a mismatch somewhere as your loss, mel-spectrogram seems alright to me.
@Lethanhson9901 Everything seems alright to me. I don't think I can help much in this case.
@Lethanhson9901 My advice:
train duration (10 minutes) and acoustic model (50 minutes) on InfoRE dataset and generate speech to make sure everything is working correctly (even though the speech is not good but it should be understandable )
train duration (10 minutes) and acoustic model (50 minutes) on Your dataset and generate speech (using your the pretrained hifigan). The speech should be understandable.
I'll try. Many thanks !
Hey, Thanks god, you're right. And it's work
B.t.w, if I want the voice is more natural and better, which part of training I should focus to ? Or pre-process data ? (like denoising, ...) and should I use audio augmentation in training ( by now I don't think it's work)
@Lethanhson9901 I'm not sure data augmentation can help. There are few things that can help:
Hi @NTT123, Could you please give me an instruction on how I can modify the decoder RNNs with 1024 units for my dataset?
@nampdn You will need to modify the file vietTTS/nat/config.py
by setting acoustic_decoder_dim=1024
Fantastic, thank you!
Hi, me again. I'm training your tts. My dataset is about 16 hours First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:
Continue to train your acoustic_checkpoint to 1.46M step: val loss:0.227 and it's gonna converge.
Train from scratch: about 800k step - val loss: 0.301
Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step: My transcript text: "xin chào tôi là phương anh bản thử số chín"
I got this : https://drive.google.com/file/d/1UtgE1gTC8mwo1SV1b7chauvWPC7uPjxM/view?usp=sharing => The result that speaker talk non-sense but intonation is quite good.
Here is 50k vocoder + 1.46M acoustic, just to compare: https://drive.google.com/file/d/1InQ8ykYC_P7qaKhv_58SmTC0r-b_4_0h/view?usp=sharing
And from 50k vocoder + 800k from scratch: https://drive.google.com/file/d/1E-FjOfBqFf9vHTKXmAUhamtB2FsAlAMT/view?usp=sharing
I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ? Thanks!