Evaluation on Chinese mandarin

begeekmyfriend commented 6 years ago

step-30000-align step-30000-pred-mel-spectrogram step-30000-real-mel-spectrogram eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work!

ben-8878 commented 6 years ago

其实这个我发现了，我觉得这个问题可能是由于一下几个原因导致，还未验证：

由于原始语音部分中高频部分就比较少，而我们使用的采样率是16khz。
可能是在前期数据处理的时候，损失掉了部分高频信号。
还有可能是使用的数据量不够，导致高频部分较少。

a. 30hr data on step 2k

b. 2hr data on step 2k

begeekmyfriend commented 6 years ago

@v-yunbin 你说的有道理，可惜我的是男声，本身训练语音的高频成份较少，所以更容易模糊，个人认为女声比男声好训练一些

ben-8878 commented 6 years ago

@begeekmyfriend 男声数据多一点也是可以的吧 --voice='male'要修改，默认是‘female'

butterl commented 6 years ago

@v-yunbin @begeekmyfriend I found that with code from head of repo， the high freq power will decrease with the train step grow. The older code do not have this effect， 300K step also have good high freq power (loss all code and data with harddisk fail , and do not remember the exactly commit id)

23K

And 140K seems loss a lot in high freq

And 240K has even lesser power in high freq

@candlewill are you using a order repo ,do you remember commit id？ seem with outputs_per_step = 1, I could not run with batch >8 with P100 from google cloud

begeekmyfriend commented 6 years ago

@butterl Would you like to provide the samples both in 140K steps and 240K steps? By the way, are the results mel or linear outputs? And how long is your total dataset?

butterl commented 6 years ago

@begeekmyfriend the dataset is thchs30 , and this is mel output， do not tried linear yet , seems linear is much better from mel. also with the new repo，feed eval mel file to wavenet trained with real wav ,the output is just noise ，and the older tacotron code（lost one） is good

for audio sample , because of firewall rule, I could not transfer file from the borrowed vm. tried talked with IT guy, but failed :(

begeekmyfriend commented 6 years ago

@butterl It seems you only used the male voices of THCHS-30 for training, right? You may drag your wav files (zipped) into the dialog box of this issue to upload your samples to the github warehouse. For the old tacotron version, you may try https://github.com/r9y9/Tacotron-2

ben-8878 commented 6 years ago

I notice some changes about Griffin Lim parameter : original repo: power = 1.55, new repo: power = 1.2,

butterl commented 6 years ago

@begeekmyfriend , wav and all env are on remote vm and could not get it to internet connected PC :( trained with nearly all THCHS-30, the wav some time turn out to be male and some time female and even strange voice (muti-speaker will have this )

ben-8878 commented 6 years ago

voice set is important, if the data you used is female, you set it as female, the train will converge faster.

begeekmyfriend commented 6 years ago

@v-yunbin The --voice option is only for the path of M-AILABS dataset. It does nothing with other dataset.

candlewill commented 6 years ago

@begeekmyfriend Sorry for late reply. I used the old version repo to get the result without blur. And the sample rate is 22050. I also tried with 16k, but high freq was blur.

ben-8878 commented 6 years ago

@begeekmyfriend 请问你跑成功wavenet train 了么

begeekmyfriend commented 6 years ago

@v-yunbin I ran r9y9's wavenet vocoder, not this one.

DaisyHH commented 6 years ago

corpus THCHS-30 has no punctuation, With punctuation, the separation of sentences is clearer Using phoneme instead of characters in /tacotron/utils/symbols.py will get better performance

@begeekmyfriend now I can get a not bad wave file, but the sound still go along with a electrical like noise at 30k+ step . a little like your result in the beginning of eval-30000.zip.

this is because of training step is not enough , or some parameter in hparams.py?

begeekmyfriend commented 6 years ago

@DaisyHH If you just need vocoder like griffin-lim, you need ~200K steps.

ben-8878 commented 6 years ago

@begeekmyfriend 请问tacotron你跑到了多少步？我的跑到了50k，合成的语音里有“嗤嗤”d的声音，是还没有完全收敛导致的么 step-50k-eval.zip

begeekmyfriend commented 6 years ago

@v-yunbin 你早就收敛了，你可以参考 @butterl 的训练https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-400577802

Rayhane-mamah commented 6 years ago

I don't understand Chinese much but I'm assuming you're saying "this issue is a peace of cake and it is fixed" :)

ben-8878 commented 6 years ago

@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊？我听了你的样本好像没有这个问题，https://github.com/Rayhane-mamah/Tacotron-2/issues/122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip

begeekmyfriend commented 6 years ago

@v-yunbin There is no punctuation in THCHS-30 dataset, you need to train your own.

kunguang commented 6 years ago

@begeekmyfriend 我这个也是THCHS-30的训练集。为啥Loss这么低了，图像缺是这样的

step-1500-align

begeekmyfriend commented 6 years ago

@kunguang 你用的最新版本么，要到40K步左右才对齐。

kunguang commented 6 years ago

@begeekmyfriend 嗯嗯，是的。整体上用的你的分支上的最新的master代码，做了一些修改。 1。 preprocess.py,symbols.py,以及datasets文件夹下的代码用的mandarin分支下的代码，

另外，我把cleaner中的text = expand_numbers(text) 去掉了，取消了数字编码。要不然把音调都用One，two编码了。。我的机器，一天也就才跑1800步。。40k_得20天啊。。你跑了多少天得到的40k的结果。。

begeekmyfriend commented 6 years ago

@kunguang GTX 1080Ti about 3~4 days. By the way, you can set outputs_per_step = 5 for acceleration. You should use my mandarin-new branch.

kunguang commented 6 years ago

多谢，我试试

logicxin commented 6 years ago

@begeekmyfriend begeekmyfriend 您好，

mandarin-new分支是基于中文普通话语料在Tacotron-2 上做的改动吗？训练基于中文普通话的模型话，可以使用mandarin-new分支吗？
mandarin-new分支支持单机多GPU吗？
依次训练tacotron2 与 wavenet 两个模型总共使用3~4天吗？

begeekmyfriend commented 6 years ago

@logicxin mandarin-new branch is for G&L and the evaluation is satisfactory. WaveNet is also available but G&L is recommended by me. This branch is only for single GPU. If you want wavenet. you might close predict_linear option and train it about 120K steps and then train with wavenet for about 1M steps. It would take you about 2 weeks for training for all. But in wavenet mode it would takes you half an hour to synthesize a clip of wav file with duration of only serveral seconds. That is why I choose G&L as the synthesizer.

logicxin commented 6 years ago

@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊？我听了你的样本好像没有这个问题，#122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip

@v-yunbin ， THCHS-30 训练结果怎么样了呢，我的当第二个模型wavenet的step=30w时，结果完全是混乱的声音。

begeekmyfriend commented 6 years ago

@logicxin You need to check your own training corpus such as whether the puntuations were marked correctly or not...

logicxin commented 6 years ago

You need to check your own training corpus such as whether the puntuations were marked correctly or not

现在正使用THCHS-30 作为训练语料，原因有可能是以下原因吗？

因为该语料的原因吗？比如其是多个男女speaker混合语料
训练的step太少（现在是step=30w）？

如果使用纯男声语料进行训练，并使用您的mandarin-new 分支。有一个问题： mandarin-new 分支与现有Rayhane-mamah/Tacotron-2 的master分支相比，做了哪些优化调整呢？

begeekmyfriend commented 6 years ago

@logicxin 我的分支主要在汉化上，其它地方基本不变，训练结果你也听到了，不过我没有用THCHS-30，一个是多人语音，另一个就是没标点。

QueenKeys commented 6 years ago

@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊？我听了你的样本好像没有这个问题，#122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip

@v-yunbin ， THCHS-30 训练结果怎么样了呢，我的当第二个模型wavenet的step=30w时，结果完全是混乱的声音。

你好，我在运行wavenet的train.py的时候报错： Traceback (most recent call last): File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/train.py", line 296, in wavenet_train(args, log_dir, hparams, input_path) File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/train.py", line 268, in wavenet_train return train(log_dir, args, hparams, input_path) File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/train.py", line 175, in train model, stats = model_train_mode(args, feeder, hparams, global_step) File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/train.py", line 123, in model_train_mode feeder.input_lengths, x=feeder.inputs) File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/models/wavenet.py", line 219, in initialize y_hat = self.step(x, c, g, softmax=False) #softmax is automatically computed inside softmax_cross_entropy if needed File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/models/wavenet.py", line 565, in step x, h = conv(x, c, g_bct) TypeError: call() takes 2 positional arguments but 4 were given 请问怎么解决？

begeekmyfriend commented 6 years ago

No need THCHS-30 any more

wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

antontc commented 6 years ago

Hello. I use your hyper parameters. Tacotron is 200k. I use THCHS-30. Now I run synthesize and get audio like this, and it is so bad. May be you can help me?

wavs.zip

JamesZHANGatTJU commented 5 years ago

@begeekmyfriend 你好，我在解压THCHS-30之后尝试进行预处理，出现了如下错误，请问如何解决？多谢！ initializing preprocessing.. Selecting data folders.. 0it [00:00, ?it/s] Write 0 utterances, 0 mel frames, 0 audio timesteps, (0.00 hours) Traceback (most recent call last): File "preprocess.py", line 112, in main() File "preprocess.py", line 108, in main run_preprocess(args, modified_hp) File "preprocess.py", line 85, in run_preprocess preprocess(args, input_folders, output_folder, hparams) File "preprocess.py", line 18, in preprocess write_metadata(metadata, out_dir) File "preprocess.py", line 30, in write_metadata print('Max input length (text chars): {}'.format(max(len(m[5]) for m in metadata))) ValueError: max() arg is an empty sequence

begeekmyfriend commented 5 years ago

No THCHS-30 any more.

wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

wuzhonglijz commented 5 years ago

@begeekmyfriend 你好，我想请教一下为什么我的模型只能合成大约14s的语音，之后的就发不出来声音了，你的模型似乎可以合成一大段文字。

begeekmyfriend commented 5 years ago

printf "file %s\n" *.wav > list.txt ffmpeg -f concat -i list.txt -c copy demo.wav

wuzhonglijz commented 5 years ago

printf "file %s\n" *.wav > list.txt ffmpeg -f concat -i list.txt -c copy demo.wav

Thanks for your reply, this script very useful to me. 👍

lydhr commented 5 years ago

@begeekmyfriend Could you please share your pretrained model based on your private dataset?

wuqi930907 commented 5 years ago

您好，我在您的https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-griffin-lim分支下训练了一个中文语音合成模型，但是发现一个问题，当我在同一块显卡上挂载多个模型进行并发合成的时候，发现每多挂载一个模型，合成时间就会线性增加。比如只挂载一个模型合成一句话需要0.7s，当挂载两个模型同时合成同一句话的时候，每个模型合成时间都会变成1.4s左右，请问您也是这样的情况吗？如果是这样的话实在是不利于实时合成和实际生产。 @begeekmyfriend

canyanol650 commented 5 years ago

eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work! can you write a blog for us to show we how to train tts mandarin use Tacotron 2 ?

coderLong commented 5 years ago

THCHS30 语料中，train data中有30个人讲话，能够区分出来哪个wav 文件是哪个人说的吗？A2_2.wav 这个文件A2表示speaker_id吗？

begeekmyfriend commented 5 years ago

Please use biaobei open mandarin corpus. Do not mention THCHS30 any more.

zhuangzhuangxie commented 5 years ago

我用标贝数据集训练了7万步（batch_size=4）,但是alignment图怎么还是这个样子呢？

step-74000-align

begeekmyfriend commented 5 years ago

Batch size has to be no less than 32.

zhuangzhuangxie commented 5 years ago

@begeekmyfriend batch_size设成32 我的GPU会out of memory, 我是否可以设成4？上图是不是因为训练还不够所以alignment图没有对齐？

bimunlp commented 5 years ago

eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!

amazing results!

zyyuan0915 commented 5 years ago

eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!

amazing results!

What data set is used?Does the data use rhythm annotation information?

Rayhane-mamah / Tacotron-2

Evaluation on Chinese mandarin #18