Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
826 stars 157 forks source link

Set parameters for training two languages dataset #7

Closed YihWenWang closed 4 years ago

YihWenWang commented 4 years ago

Hello If I just train two kind of datasets, how do I set the parameters, such as generator_dim and generator_bottleneck_dim ...etc for generator_switching.json .

Tomiinek commented 4 years ago

Hi! :smile: I am not sure I understand your question :worried: What do you mean by "two kind of datasets"? You can change the parameters arbitrarily or create another file with your parameters. The dataset is specified by the dataset parameter (with values such as css_comvoi, css10, ljspeech)

YihWenWang commented 4 years ago

Thanks The situation that I trained the dataset including two kind of language (English and Chinese) and I didn't change any parameters except for "languages" in generator_switching.json, but I got the bad result when I trained to epoch 120. When I synthesized two languages in a sentence and I had assigned the speaker, but the result had the voice of two speakers, not one speaker. I don't have any idea for this result.

Tomiinek commented 4 years ago

Oh, I see.

First, training on two languages does not make use of the model capabilities. The more languages you have, the more information which can be shared across languages is present.

Second, you should have more speakers for each language (or you should have very similar voices for both mono-speaker languages). The model has to learn the difference between language- and speaker- dependent information. However, there is the multi-speaker VCTK dataset for English (a subset with a few speakers should be sufficient) and some Chinese voices in my cleaned Common Voice data (see readme). You do not have to have a lot of examples for each speaker (50 transcript-recording pairs per speaker can be ok), but you should have more speakers with diverse voices (such as 10 or 20). If this is you case, just add these multi-speaker data to you actual dataset and it should be better.

Third, you should definitely reduce generator_dim to something like 2..4 and generator_bottleneck_dim should be lower than this number, e.g., 1 or 2. Also speaker_embedding_dimension should be changed to roughly correspond to the number of speakers you have. So if you have like 20 speakers, 16 or 32 ...

Finally, there is reversal_classifier_w which controls the weight of the adversarial speaker classifier's loss. This parameter is really tricky. High values prevent the model from convergence, low values cause no effect. However, you should first try to make your data multi-speaker.

YihWenWang commented 4 years ago

okay Thank you very much. I will try to train the dataset including more speakers for each language and adjust the parameters.

YihWenWang commented 4 years ago

Hello, I use the VCTK dataset including 108 speakers for english and THCHS-30 dataset including 60 speakers for chinese. My generated_switching.json setting : generator_dim = 4, generator_bottleneck_dim = 2, speaker_embedding_dimension = 64, reversal_classifier_w = 0.125 And the training result, I could synthesize two languages in a sentence by one speaker. But there is a problem. If I synthesize a sentence such as "recommend the some 社會書。", the voice of the second half of the sentence could become smaller. I don't have any idea for this result.

Tomiinek commented 4 years ago

你好 :grin:

Do I understand it correctly that the voice seems to be the same throughout the whole sentence, but volume changes? This might be caused by the recordings (from two different datasets) normalized in a different way. Do the corresponding spectrograms have similar magnitudes?

You can try to normalize your audio files and repeat training with the new data. You can for example run this command to normalize every .wav in your-directory to the same volume level.

find "your-directory" -name '*.wav' | while read f; do
    sox "${f}" tmp.wav gain -n -3 && mv tmp.wav "${f}"
done

Hope it helps :innocent: 再见

YihWenWang commented 4 years ago

Thanks for your suggestion. I will try it. But I still have a question, if the sample rate of dataset is 16000 Hz, which parameters should I modify except for sample rate.

Tomiinek commented 4 years ago

You do not have to change stft_window_ms nor stft_shift_ms, because these values are in milliseconds. However, you can reduce num_fft to a lower value such as 1024, because stft_window_ms * sample_rate gives you something around 800.

YihWenWang commented 4 years ago

Sorry, I want to ask you about some question. How do I get the mel-spectrogram from .npy file? Because I want to use Waveglow to synthesize the waveform.

Tomiinek commented 4 years ago

Hm, spectrograms are stored in two-dimensional numpy arrays and saved into .npy files. Just use numpy.load to load them back into memory.

If you want to train the Waveglow model on spectrograms produced by your Tacotron, use the gta.py script which can produce ground-truth-aligned spectrograms (GTA) given your model and original spectrograms.

YihWenWang commented 4 years ago

12235295496817 But I don't train waveglow, I had trained waveglow. I want to feed synthesized the .npy file to waveglow and synthesize the audio by waveglow. The figure is mel-spectrogram that I load the .npy file and try to synthesize the audio from waveglow. But I get the noise audio.

Tomiinek commented 4 years ago

Spectrograms seem to be ok, so I am afraid I cannot help you.

Just a few hints that come to my mind and can help you debugging:

YihWenWang commented 4 years ago

About "language_embedding_dimension" and "generator_dim" , do they have the same meaning ? And then, why language_embedding_dimension is set zero when training the five languages model ?

Tomiinek commented 4 years ago

No, they don't. language_embedding_dimension specifies the dimension of the language embedding concatenated to the decoder input. generator_dim defines the dimension of the language embedding used in the parameter generator. It is set to zero because the model has enough information about the language from the encoder.

lightwithshadow commented 4 years ago

@YihWenWang Hello Wang, Can you share your synthesized speech samples? THX!

YihWenWang commented 4 years ago

Hello

Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow.

gy_Zhao notifications@github.com於 2020年9月25日 週五,上午9:52寫道:

@YihWenWang https://github.com/YihWenWang Hello Wang, Can you share your synthesized speech samples?

THX!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Tomiinek/Multilingual_Text_to_Speech/issues/7#issuecomment-698676388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA .

lightwithshadow commented 4 years ago

ok, thx!

------------------ 原始邮件 ------------------ 发件人: "Tomiinek/Multilingual_Text_to_Speech" <notifications@github.com>; 发送时间: 2020年9月25日(星期五) 上午9:59 收件人: "Tomiinek/Multilingual_Text_to_Speech"<Multilingual_Text_to_Speech@noreply.github.com>; 抄送: "1101174181"<1101174181@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7)

Hello

Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow.

gy_Zhao <notifications@github.com>於 2020年9月25日 週五,上午9:52寫道:

> > > @YihWenWang <https://github.com/YihWenWang&gt; Hello Wang, Can you share > your synthesized speech samples? > > > THX! > > > > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <https://github.com/Tomiinek/Multilingual_Text_to_Speech/issues/7#issuecomment-698676388&gt;, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA&gt; > . > > >

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

YihWenWang commented 4 years ago

They are my synthesis speech. I use the VCTK and STCMDS datasets to train this model. My synthesized text is "Recommend the some 社會書。". Thanks. VCTK_STCMDS.zip

lightwithshadow commented 4 years ago

Thanks

发自我的iPhone

------------------ Original ------------------ From: YihWenWang <notifications@github.com> Date: Fri,Sep 25,2020 10:10 PM To: Tomiinek/Multilingual_Text_to_Speech <Multilingual_Text_to_Speech@noreply.github.com> Cc: gy_Zhao <1101174181@qq.com>, Comment <comment@noreply.github.com> Subject: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7)

They are my synthesis speech. I use the VCTK and STCMDS datasets to train this model. VCTK_STCMDS.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Maxxiey commented 3 years ago

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

YihWenWang commented 3 years ago

@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?

Yes, I only use datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).

leijue222 commented 3 years ago

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:

——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it. As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?

YihWenWang commented 3 years ago

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:

——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it. As in the above example, I am not familiar with the TTS field. Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers). My steps:

  1. Download the datasets and put the datasets into the "data" folder.
  2. Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
  3. I use the "pypinyin" package to convert Mandarin text.
  4. In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
  5. Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
  6. The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.
leijue222 commented 3 years ago

@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:

——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。

I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it. As in the above example, I am not familiar with the TTS field. Could you give me a simple suggestion of steps to change this project?

I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers). My steps:

  1. Download the datasets and put the datasets into the "data" folder.
  2. Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
  3. I use the "pypinyin" package to convert Mandarin text.
  4. In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
  5. Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
  6. The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.

Thank you for your reply. Your suggestion gives me a clearer idea to do this work. Thanks again!

leijue222 commented 3 years ago

他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

YihWenWang commented 3 years ago

他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

leijue222 commented 3 years ago

他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

YihWenWang commented 3 years ago

他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip

Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.

Hello, I use V100 to train for three days.

By the way, do you use phonemes when training?

No, I didn't use phonemes when training. I just use the label of text, speaker, language, spectrogram, and linear spectrogram.

leijue222 commented 3 years ago

Thanks. I have a problem with the tones of Mandarin. No matter I use pinyin or phonemes to train, the pronunciation of the four tones are not accurate.

I use the "pypinyin" package to convert Mandarin text.

The different we are: I use pinyin package of requirement.txt file. I use biaobei dataset(10,000) and LJSpeech dataset(5,000).

I don't know what causes this. Do you have any ideas or could you share params.py with me?

SayHelloRudy commented 2 years ago

他们是我的综合演讲。 我用VCTK和STCMDS数据集来训练这个模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip

你好!我现在在LJSpeech和Biaobei上训练,小时= 12,历元= 25,步骤== 9K。目前,英语开始听起来像单词,但中文仍然听起来像什么。 所以我想问一下你训练需要多长时间,中文结果会开始图像汉字,需要多长时间才能得到你认为好的结果。

你好,我用V100训练了三天。

大问一下,你在训练时使用音素吗?

不,我在训练时没有使用音素。 我只使用文本、说话者、语言、描绘图和形象图的标签。 请问你对VCTK数据集做了消除静音的处理吗

DoritoDog commented 1 year ago

@YihWenWang How did you get the Tacotron mel spectrograms to work with waveglow in the end? They seem similar, but look like they are normalized somehow.

Tacotron value examples

[-54.77068739 -47.15882725 -45.828745   -44.59372329 -43.22799777
 -42.7517943  -42.11187298 -42.25688537 -42.81581903 -43.02588636, ...]

Waveglow value examples (for same audio file)

[-3.9470453 -2.820666  -2.7616765 -2.5435247 -2.5574331 -2.2251318
 -2.0958776 -2.0956624 -2.170114  -2.0375078, ...]