Closed begeekmyfriend closed 6 years ago
其实这个我发现了,我觉得这个问题可能是由于一下几个原因导致,还未验证:
a. 30hr data on step 2k
b. 2hr data on step 2k
@v-yunbin 你说的有道理,可惜我的是男声,本身训练语音的高频成份较少,所以更容易模糊,个人认为女声比男声好训练一些
@begeekmyfriend 男声数据多一点也是可以的吧 --voice='male'要修改,默认是‘female'
@v-yunbin @begeekmyfriend I found that with code from head of repo, the high freq power will decrease with the train step grow. The older code do not have this effect, 300K step also have good high freq power (loss all code and data with harddisk fail , and do not remember the exactly commit id)
23K
And 140K seems loss a lot in high freq
And 240K has even lesser power in high freq
@candlewill are you using a order repo ,do you remember commit id? seem with outputs_per_step = 1, I could not run with batch >8 with P100 from google cloud
@butterl Would you like to provide the samples both in 140K steps and 240K steps? By the way, are the results mel or linear outputs? And how long is your total dataset?
@begeekmyfriend the dataset is thchs30 , and this is mel output, do not tried linear yet , seems linear is much better from mel. also with the new repo,feed eval mel file to wavenet trained with real wav ,the output is just noise ,and the older tacotron code(lost one) is good
for audio sample , because of firewall rule, I could not transfer file from the borrowed vm. tried talked with IT guy, but failed :(
@butterl It seems you only used the male voices of THCHS-30 for training, right? You may drag your wav files (zipped) into the dialog box of this issue to upload your samples to the github warehouse. For the old tacotron version, you may try https://github.com/r9y9/Tacotron-2
I notice some changes about Griffin Lim parameter : original repo: power = 1.55, new repo: power = 1.2,
@begeekmyfriend , wav and all env are on remote vm and could not get it to internet connected PC :( trained with nearly all THCHS-30, the wav some time turn out to be male and some time female and even strange voice (muti-speaker will have this )
voice set is important, if the data you used is female, you set it as female, the train will converge faster.
@v-yunbin The --voice
option is only for the path of M-AILABS
dataset. It does nothing with other dataset.
@begeekmyfriend Sorry for late reply. I used the old version repo to get the result without blur. And the sample rate is 22050. I also tried with 16k, but high freq was blur.
@begeekmyfriend 请问你跑成功wavenet train 了么
@v-yunbin I ran r9y9's wavenet vocoder, not this one.
corpus THCHS-30 has no punctuation, With punctuation, the separation of sentences is clearer Using phoneme instead of characters in /tacotron/utils/symbols.py will get better performance
@begeekmyfriend now I can get a not bad wave file, but the sound still go along with a electrical like noise at 30k+ step . a little like your result in the beginning of eval-30000.zip.
this is because of training step is not enough , or some parameter in hparams.py?
@DaisyHH If you just need vocoder like griffin-lim, you need ~200K steps.
@begeekmyfriend 请问tacotron你跑到了多少步?我的跑到了50k,合成的语音里有“嗤嗤”d的声音,是还没有完全收敛导致的么 step-50k-eval.zip
@v-yunbin 你早就收敛了,你可以参考 @butterl 的训练https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-400577802
I don't understand Chinese much but I'm assuming you're saying "this issue is a peace of cake and it is fixed" :)
@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊?我听了你的样本好像没有这个问题,https://github.com/Rayhane-mamah/Tacotron-2/issues/122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip
@v-yunbin There is no punctuation in THCHS-30
dataset, you need to train your own.
@begeekmyfriend 我这个也是THCHS-30的训练集。为啥Loss这么低了,图像缺是这样的
@kunguang 你用的最新版本么,要到40K步左右才对齐。
@begeekmyfriend 嗯嗯,是的。整体上用的你的分支上的最新的master代码,做了一些修改。 1。 preprocess.py,symbols.py,以及datasets文件夹下的代码用的mandarin分支下的代码,
@kunguang GTX 1080Ti about 3~4 days. By the way, you can set outputs_per_step = 5
for acceleration.
You should use my mandarin-new
branch.
多谢,我试试
@begeekmyfriend begeekmyfriend 您好,
@logicxin
mandarin-new
branch is for G&L and the evaluation is satisfactory. WaveNet is also available but G&L is recommended by me.
This branch is only for single GPU.
If you want wavenet. you might close predict_linear
option and train it about 120K steps and then train with wavenet for about 1M steps. It would take you about 2 weeks for training for all. But in wavenet mode it would takes you half an hour to synthesize a clip of wav file with duration of only serveral seconds. That is why I choose G&L as the synthesizer.
@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊?我听了你的样本好像没有这个问题,#122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip
@v-yunbin , THCHS-30 训练结果怎么样了呢, 我的当第二个模型wavenet的step=30w时, 结果完全是混乱的声音。
@logicxin You need to check your own training corpus such as whether the puntuations were marked correctly or not...
You need to check your own training corpus such as whether the puntuations were marked correctly or not
现在正使用THCHS-30 作为训练语料,原因有可能是以下原因吗?
如果使用纯男声语料进行训练,并使用您的mandarin-new 分支。有一个问题: mandarin-new 分支与现有Rayhane-mamah/Tacotron-2 的master分支相比,做了哪些优化调整呢?
@logicxin 我的分支主要在汉化上,其它地方基本不变,训练结果你也听到了,不过我没有用THCHS-30,一个是多人语音,另一个就是没标点。
@begeekmyfriend 合成的语音停顿混乱可能是什么原因啊?我听了你的样本好像没有这个问题,#122 text: yu2 jian4 jun1 wei4 mei3 ge4, you3 cai2 neng2 de5 ren2, ti2 gong1 ping2 tai2,ta1 shi4 yin1 pin2 ling3 yu4 de5, tao2 bao3 tian1 mao1, zai4 zhe4 ge4 ping2 tai2 shang4, mei3 ge4 nei4 rong2 sheng1 chan3 zhe3, dou1 ke3 yi3 hen3 fang1 bian4 de5,shi1 xian4 zi4 wo3 jia4 zhi2, geng4 duo1 de5 ren2, yong1 you3 wei1 chuang4 ye4 de5 ji1 hui4, bu4 guo4 ta1 men5 zhi3 shi4 da1 dang4, bu2 shi4 chang2 jian4 de5 fu1 qi1 dang4 mo2 shi4, yong4 yu2 jian4 jun1 de5 hua4 lai2 shuo1, zhe4 ge4 mo2 shi4 ye3 bu4 chang2 jian4, wave: speech-wav-00001-linear.zip
@v-yunbin , THCHS-30 训练结果怎么样了呢, 我的当第二个模型wavenet的step=30w时, 结果完全是混乱的声音。
你好,我在运行wavenet的train.py的时候报错:
Traceback (most recent call last):
File "/home/queen/document/Experiment/Tacotron-2-master/wavenet_vocoder/train.py", line 296, in
No need THCHS-30
any more
wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
Hello. I use your hyper parameters. Tacotron is 200k. I use THCHS-30. Now I run synthesize and get audio like this, and it is so bad. May be you can help me?
@begeekmyfriend 你好,我在解压THCHS-30之后尝试进行预处理,出现了如下错误,请问如何解决?多谢!
initializing preprocessing..
Selecting data folders..
0it [00:00, ?it/s]
Write 0 utterances, 0 mel frames, 0 audio timesteps, (0.00 hours)
Traceback (most recent call last):
File "preprocess.py", line 112, in
No THCHS-30 any more.
wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
@begeekmyfriend 你好,我想请教一下为什么我的模型只能合成大约14s的语音,之后的就发不出来声音了,你的模型似乎可以合成一大段文字。
printf "file %s\n" *.wav > list.txt ffmpeg -f concat -i list.txt -c copy demo.wav
printf "file %s\n" *.wav > list.txt ffmpeg -f concat -i list.txt -c copy demo.wav
Thanks for your reply, this script very useful to me. 👍
@begeekmyfriend Could you please share your pretrained model based on your private dataset?
您好,我在您的https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-griffin-lim分支下训练了一个中文语音合成模型,但是发现一个问题,当我在同一块显卡上挂载多个模型进行并发合成的时候,发现每多挂载一个模型,合成时间就会线性增加。比如只挂载一个模型合成一句话需要0.7s,当挂载两个模型同时合成同一句话的时候,每个模型合成时间都会变成1.4s左右,请问您也是这样的情况吗?如果是这样的话实在是不利于实时合成和实际生产。 @begeekmyfriend
eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work! can you write a blog for us to show we how to train tts mandarin use Tacotron 2 ?
THCHS30 语料中,train data中有30个人讲话,能够区分出来哪个wav 文件是哪个人说的吗?A2_2.wav 这个文件A2表示speaker_id吗?
Please use biaobei open mandarin corpus. Do not mention THCHS30
any more.
我用标贝数据集训练了7万步(batch_size=4),但是alignment图怎么还是这个样子呢?
Batch size has to be no less than 32.
@begeekmyfriend batch_size设成32 我的GPU会out of memory, 我是否可以设成4? 上图是不是因为训练还不够所以alignment图没有对齐?
eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!
amazing results!
eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!
amazing results!
What data set is used?Does the data use rhythm annotation information?
eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work!