Closed YihWenWang closed 4 years ago
Hi! :smile:
I am not sure I understand your question :worried: What do you mean by "two kind of datasets"? You can change the parameters arbitrarily or create another file with your parameters. The dataset is specified by the dataset
parameter (with values such as css_comvoi
, css10
, ljspeech
)
Thanks The situation that I trained the dataset including two kind of language (English and Chinese) and I didn't change any parameters except for "languages" in generator_switching.json, but I got the bad result when I trained to epoch 120. When I synthesized two languages in a sentence and I had assigned the speaker, but the result had the voice of two speakers, not one speaker. I don't have any idea for this result.
Oh, I see.
First, training on two languages does not make use of the model capabilities. The more languages you have, the more information which can be shared across languages is present.
Second, you should have more speakers for each language (or you should have very similar voices for both mono-speaker languages). The model has to learn the difference between language- and speaker- dependent information. However, there is the multi-speaker VCTK dataset for English (a subset with a few speakers should be sufficient) and some Chinese voices in my cleaned Common Voice data (see readme). You do not have to have a lot of examples for each speaker (50 transcript-recording pairs per speaker can be ok), but you should have more speakers with diverse voices (such as 10 or 20). If this is you case, just add these multi-speaker data to you actual dataset and it should be better.
Third, you should definitely reduce generator_dim
to something like 2..4 and generator_bottleneck_dim
should be lower than this number, e.g., 1 or 2. Also speaker_embedding_dimension
should be changed to roughly correspond to the number of speakers you have. So if you have like 20 speakers, 16 or 32 ...
Finally, there is reversal_classifier_w
which controls the weight of the adversarial speaker classifier's loss. This parameter is really tricky. High values prevent the model from convergence, low values cause no effect. However, you should first try to make your data multi-speaker.
okay Thank you very much. I will try to train the dataset including more speakers for each language and adjust the parameters.
Hello, I use the VCTK dataset including 108 speakers for english and THCHS-30 dataset including 60 speakers for chinese. My generated_switching.json setting : generator_dim = 4, generator_bottleneck_dim = 2, speaker_embedding_dimension = 64, reversal_classifier_w = 0.125 And the training result, I could synthesize two languages in a sentence by one speaker. But there is a problem. If I synthesize a sentence such as "recommend the some 社會書。", the voice of the second half of the sentence could become smaller. I don't have any idea for this result.
你好 :grin:
Do I understand it correctly that the voice seems to be the same throughout the whole sentence, but volume changes? This might be caused by the recordings (from two different datasets) normalized in a different way. Do the corresponding spectrograms have similar magnitudes?
You can try to normalize your audio files and repeat training with the new data. You can for example run this command to normalize every .wav
in your-directory
to the same volume level.
find "your-directory" -name '*.wav' | while read f; do
sox "${f}" tmp.wav gain -n -3 && mv tmp.wav "${f}"
done
Hope it helps :innocent: 再见
Thanks for your suggestion. I will try it. But I still have a question, if the sample rate of dataset is 16000 Hz, which parameters should I modify except for sample rate.
You do not have to change stft_window_ms
nor stft_shift_ms
, because these values are in milliseconds. However, you can reduce num_fft
to a lower value such as 1024, because stft_window_ms * sample_rate
gives you something around 800.
Sorry, I want to ask you about some question. How do I get the mel-spectrogram from .npy file? Because I want to use Waveglow to synthesize the waveform.
Hm, spectrograms are stored in two-dimensional numpy arrays and saved into .npy
files. Just use numpy.load
to load them back into memory.
If you want to train the Waveglow model on spectrograms produced by your Tacotron, use the gta.py
script which can produce ground-truth-aligned spectrograms (GTA) given your model and original spectrograms.
But I don't train waveglow, I had trained waveglow. I want to feed synthesized the .npy file to waveglow and synthesize the audio by waveglow. The figure is mel-spectrogram that I load the .npy file and try to synthesize the audio from waveglow. But I get the noise audio.
Spectrograms seem to be ok, so I am afraid I cannot help you.
Just a few hints that come to my mind and can help you debugging:
About "language_embedding_dimension" and "generator_dim" , do they have the same meaning ? And then, why language_embedding_dimension is set zero when training the five languages model ?
No, they don't. language_embedding_dimension
specifies the dimension of the language embedding concatenated to the decoder input. generator_dim
defines the dimension of the language embedding used in the parameter generator. It is set to zero because the model has enough information about the language from the encoder.
@YihWenWang Hello Wang, Can you share your synthesized speech samples? THX!
Hello
Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow.
gy_Zhao notifications@github.com於 2020年9月25日 週五,上午9:52寫道:
@YihWenWang https://github.com/YihWenWang Hello Wang, Can you share your synthesized speech samples?
THX!
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/Tomiinek/Multilingual_Text_to_Speech/issues/7#issuecomment-698676388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA .
ok, thx!
------------------ 原始邮件 ------------------ 发件人: "Tomiinek/Multilingual_Text_to_Speech" <notifications@github.com>; 发送时间: 2020年9月25日(星期五) 上午9:59 收件人: "Tomiinek/Multilingual_Text_to_Speech"<Multilingual_Text_to_Speech@noreply.github.com>; 抄送: "1101174181"<1101174181@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7)
Hello
Okay, but I don’t have computer at the moment. Sorry, could you wait for me? I will give you about my synthesis speech tomorrow.
gy_Zhao <notifications@github.com>於 2020年9月25日 週五,上午9:52寫道:
> > > @YihWenWang <https://github.com/YihWenWang> Hello Wang, Can you share > your synthesized speech samples? > > > THX! > > > > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <https://github.com/Tomiinek/Multilingual_Text_to_Speech/issues/7#issuecomment-698676388>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEV5X3U44UM77VUWONZJHDLSHPZXVANCNFSM4NZSOWKA> > . > > >
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
They are my synthesis speech. I use the VCTK and STCMDS datasets to train this model. My synthesized text is "Recommend the some 社會書。". Thanks. VCTK_STCMDS.zip
Thanks
发自我的iPhone
------------------ Original ------------------ From: YihWenWang <notifications@github.com> Date: Fri,Sep 25,2020 10:10 PM To: Tomiinek/Multilingual_Text_to_Speech <Multilingual_Text_to_Speech@noreply.github.com> Cc: gy_Zhao <1101174181@qq.com>, Comment <comment@noreply.github.com> Subject: Re: [Tomiinek/Multilingual_Text_to_Speech] Set parameters for training two languages dataset (#7)
They are my synthesis speech. I use the VCTK and STCMDS datasets to train this model. VCTK_STCMDS.zip
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?
@YihWenWang Samples sound nice, did you train this model based only on VCTK and STCMDS ?
Yes, I only use datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers).
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:
——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。
I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it.
As in the above example, I am not familiar with the TTS field.
Could you give me a simple suggestion of steps to change this project?
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:
——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。
I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it. As in the above example, I am not familiar with the TTS field. Could you give me a simple suggestion of steps to change this project?
I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers). My steps:
@YihWenWang Hi, I'm trying to train a Chinese-English mixed TTS model too. Just like our daily talking, such as:
——A:请问你从事什么领域的研究? ——B:我从事Computer Vision方面的研究工作。
I plan to use the datasets of LJSpeech, ST-CMDS, and Biaobei to train it. As in the above example, I am not familiar with the TTS field. Could you give me a simple suggestion of steps to change this project?
I use the datasets of VCTK(English, 30 speakers) and STCMDS(Mandarin, 30 speakers). My steps:
- Download the datasets and put the datasets into the "data" folder.
- Organize the train.txt and val.txt . There must contain the label of the speaker, language, audio, spectrograms, linear spectrograms, and text. It does not necessarily require transliteration.
- I use the "pypinyin" package to convert Mandarin text.
- In the "data" folder, the prepare_css_spectrograms.py must be modified. You have to change the path of the dataset.
- Note whether the parameter "sample rate" matches the sample rate of the audio in the datasets.
- The parameter "generator_bottleneck_dim" and "generator_dim" are adjusted according to the number of languages.
Thank you for your reply. Your suggestion gives me a clearer idea to do this work. Thanks again!
他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip
Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip
Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
Hello, I use V100 to train for three days.
他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip
Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
Hello, I use V100 to train for three days.
By the way, do you use phonemes when training?
他们是我的综合演讲。 我使用VCTK和STCMDS数据集来训练此模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip
Hi! I training on LJSpeech and Biaobei now, hours=12, epoch=25, steps==9K. Currently, English begins to sound like words, but the Chinese still sounds like nothing. So I would like to ask you how long it takes you to train, the Chinese result will begin to sounds like Chinese characters, and how long does it take to get the final result you think is good.
Hello, I use V100 to train for three days.
By the way, do you use phonemes when training?
No, I didn't use phonemes when training. I just use the label of text, speaker, language, spectrogram, and linear spectrogram.
Thanks. I have a problem with the tones of Mandarin. No matter I use pinyin or phonemes to train, the pronunciation of the four tones are not accurate.
I use the "pypinyin" package to convert Mandarin text.
The different we are: I use pinyin package of requirement.txt file. I use biaobei dataset(10,000) and LJSpeech dataset(5,000).
I don't know what causes this. Do you have any ideas or could you share params.py
with me?
他们是我的综合演讲。 我用VCTK和STCMDS数据集来训练这个模型。 我的综合文字是“推荐一些社会书。”。 谢谢。 VCTK_STCMDS.zip
你好!我现在在LJSpeech和Biaobei上训练,小时= 12,历元= 25,步骤== 9K。目前,英语开始听起来像单词,但中文仍然听起来像什么。 所以我想问一下你训练需要多长时间,中文结果会开始图像汉字,需要多长时间才能得到你认为好的结果。
你好,我用V100训练了三天。
大问一下,你在训练时使用音素吗?
不,我在训练时没有使用音素。 我只使用文本、说话者、语言、描绘图和形象图的标签。 请问你对VCTK数据集做了消除静音的处理吗
@YihWenWang How did you get the Tacotron mel spectrograms to work with waveglow in the end? They seem similar, but look like they are normalized somehow.
Tacotron value examples
[-54.77068739 -47.15882725 -45.828745 -44.59372329 -43.22799777
-42.7517943 -42.11187298 -42.25688537 -42.81581903 -43.02588636, ...]
Waveglow value examples (for same audio file)
[-3.9470453 -2.820666 -2.7616765 -2.5435247 -2.5574331 -2.2251318
-2.0958776 -2.0956624 -2.170114 -2.0375078, ...]
Hello If I just train two kind of datasets, how do I set the parameters, such as generator_dim and generator_bottleneck_dim ...etc for generator_switching.json .