Plachtaa / VITS-fast-fine-tuning

This repo is a pipeline of VITS finetuning for fast speaker adaptation TTS, and many-to-many voice conversion
Apache License 2.0
4.71k stars 706 forks source link

作者你好,我直接在你底模基础上加入更多中文和英文数据集, 训练出来中英文效果是不是会更好 #220

Closed NsLearning closed 11 months ago

NsLearning commented 1 year ago

目前感觉中英文口音好像效果不太好,想训练中英文tts,可以在这基础上直接训练吧?

Plachtaa commented 1 year ago

可以

nekomiya-hinata commented 1 year ago

实验出来效果确实比较显著,就算使用CJE的底模,中文合成后的大佐音与英文的发音效果也明显改善了很多,就是训练时间也显著加长了(使用了共2w+条的训练语音,2080ti跑一个epoch需要12分钟左右)

NsLearning commented 1 year ago

实验出来效果确实比较显著,就算使用CJE的底模,中文合成后的大佐音与英文的发音效果也明显改善了很多,就是训练时间也显著加长了(使用了共2w+条的训练语音,2080ti跑一个epoch需要12分钟左右)

中文和英文各用了多少条啊?每个epoch多少步,多少个epoch后效果比较好? 我还没试

nekomiya-hinata commented 1 year ago

实验出来效果确实比较显著,就算使用CJE的底模,中文合成后的大佐音与英文的发音效果也明显改善了很多,就是训练时间也显著加长了(使用了共2w+条的训练语音,2080ti跑一个epoch需要12分钟左右)

中文和英文各用了多少条啊?每个epoch多少步,多少个epoch后效果比较好? 我还没试

来自7500条英文与1.3w条中文,没有额外加日语语音,batch_size是16,每个epoch大概1395步,我自己感觉大概60个epoch就有不错的效果了

NsLearning commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调,batch_size=32,69个epoch后效果如下: text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的,作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看,中文改善了许多,音色也比较像,但是英文语速还是很快,感觉很多连读或者漏读且音色不像,貌似是底模(CJE)英文效果不太行的缘故? 如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊,求大佬指点@Plachtaa

NsLearning commented 1 year ago

实验出来效果确实比较显著,就算使用CJE的底模,中文合成后的大佐音与英文的发音效果也明显改善了很多,就是训练时间也显著加长了(使用了共2w+条的训练语音,2080ti跑一个epoch需要12分钟左右)

中文和英文各用了多少条啊?每个epoch多少步,多少个epoch后效果比较好? 我还没试

来自7500条英文与1.3w条中文,没有额外加日语语音,batch_size是16,每个epoch大概1395步,我自己感觉大概60个epoch就有不错的效果了

英文效果如何啊, 我感觉英文效果还是不太行,我用了1.7万条英文。

Plachtaa commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调,batch_size=32,69个epoch后效果如下: text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的,作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看,中文改善了许多,音色也比较像,但是英文语速还是很快,感觉很多连读或者漏读且音色不像,貌似是底模(CJE)英文效果不太行的缘故? 如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊,求大佬指点@Plachtaa

我的建议是尽量不要在一个VITS模型里训练太多种语言,一个模型只管一种语言效果是最好的,因为小模型容量有限。

NsLearning commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调,batch_size=32,69个epoch后效果如下: text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的,作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看,中文改善了许多,音色也比较像,但是英文语速还是很快,感觉很多连读或者漏读且音色不像,貌似是底模(CJE)英文效果不太行的缘故? 如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊,求大佬指点@Plachtaa

我的建议是尽量不要在一个VITS模型里训练太多种语言,一个模型只管一种语言效果是最好的,因为小模型容量有限。

那么硬要训练中英双语的话, 最好重新开始训练底模对吧,毕竟日语占有一定容量

NsLearning commented 1 year ago

thanks for the great work!!! I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below: https://github.com/NsLearning/tts-work/blob/main/Obama_EN_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_EN_2.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_2.wav For ZH tts, Chinese character has a better work, same as English tts, and now it is not bad to synthesis both ZH and EN, so grateful for the author of the project. I know there is still much space of improvement but try other methods in next time.

treya-lin commented 1 year ago

thanks for the great work!!! I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below: https://github.com/NsLearning/tts-work/blob/main/Obama_EN_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_EN_2.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_2.wav For ZH tts, Chinese character has a better work, same as English tts, and now it is not bad to synthesis both ZH and EN, so grateful for the author of the project. I know there is still much space of improvement but try other methods in next time.

Hi 想问一下,你的训练流程是这样吗? 阶段一: 17k 纯英文、20k纯中文数据,200 epoch 阶段二:9k 中英混数据 + 【部分阶段一的数据还是一些额外的纯中文纯英文说话人的数据?】,100epoch 我主要是第二阶段用的数据没太理解,能具体说一下是怎么样的数据构成吗?奥巴马的数据也是额外在第二阶段加入的吗?

mikeyang01 commented 1 year ago

I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below:

Hi~, 您好 what data are you using? Is 17k English VCTK corpus?

shirubei commented 11 months ago

I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below:

Hi~, 您好 what data are you using? Is 17k English VCTK corpus?

同问,另外具体步骤可以分享吗,谢谢。

NsLearning commented 11 months ago

@shirubei @treya-lin @mikeyang01 你们好,中文语料库的aishell 3, 英文语料库用的VCTK,说双语的语音我是抓包microsoft的tts,可以用edge-tts,注意一下访问频率。目标音色的大量语音是用tortoise tts 生成的,可以用几个片段就能克隆音色,缺点是慢。把这些数据放到vits fine-tune训练,这种方式我是自己瞎倒腾的,感觉效果非常一般。

shirubei commented 11 months ago

@shirubei @treya-lin @mikeyang01 你们好,中文语料库的aishell 3, 英文语料库用的VCTK,说双语的语音我是抓包microsoft的tts,可以用edge-tts,注意一下访问频率。目标音色的大量语音是用tortoise tts 生成的,可以用几个片段就能克隆音色,缺点是慢。把这些数据放到vits fine-tune训练,这种方式我是自己瞎倒腾的,感觉效果非常一般。

非常感谢您的回复。 我前2周用标准的中文(标贝,有1万条数据)和标准的英文(LJ Speech,这个数据也不少)来训练,完了之后感觉中文和英文都大幅提高了。 另外有一点,感觉有某些个中文的字词,就是无法到达满意的发音,稍微注意一点还是感觉像老外的发音,估计还是与底模相关。