作者你好，我直接在你底模基础上加入更多中文和英文数据集，训练出来中英文效果是不是会更好

NsLearning commented 1 year ago

目前感觉中英文口音好像效果不太好，想训练中英文tts，可以在这基础上直接训练吧？

Plachtaa commented 1 year ago

可以

nekomiya-hinata commented 1 year ago

实验出来效果确实比较显著，就算使用CJE的底模，中文合成后的大佐音与英文的发音效果也明显改善了很多，就是训练时间也显著加长了（使用了共2w+条的训练语音，2080ti跑一个epoch需要12分钟左右）

NsLearning commented 1 year ago

实验出来效果确实比较显著，就算使用CJE的底模，中文合成后的大佐音与英文的发音效果也明显改善了很多，就是训练时间也显著加长了（使用了共2w+条的训练语音，2080ti跑一个epoch需要12分钟左右）

中文和英文各用了多少条啊？每个epoch多少步，多少个epoch后效果比较好？我还没试

nekomiya-hinata commented 1 year ago

实验出来效果确实比较显著，就算使用CJE的底模，中文合成后的大佐音与英文的发音效果也明显改善了很多，就是训练时间也显著加长了（使用了共2w+条的训练语音，2080ti跑一个epoch需要12分钟左右）

中文和英文各用了多少条啊？每个epoch多少步，多少个epoch后效果比较好？我还没试

来自7500条英文与1.3w条中文，没有额外加日语语音，batch_size是16，每个epoch大概1395步，我自己感觉大概60个epoch就有不错的效果了

NsLearning commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调，batch_size=32，69个epoch后效果如下： text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的，作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看，中文改善了许多，音色也比较像，但是英文语速还是很快，感觉很多连读或者漏读且音色不像，貌似是底模（CJE)英文效果不太行的缘故？如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊，求大佬指点@Plachtaa

NsLearning commented 1 year ago

实验出来效果确实比较显著，就算使用CJE的底模，中文合成后的大佐音与英文的发音效果也明显改善了很多，就是训练时间也显著加长了（使用了共2w+条的训练语音，2080ti跑一个epoch需要12分钟左右）

中文和英文各用了多少条啊？每个epoch多少步，多少个epoch后效果比较好？我还没试

来自7500条英文与1.3w条中文，没有额外加日语语音，batch_size是16，每个epoch大概1395步，我自己感觉大概60个epoch就有不错的效果了

英文效果如何啊，我感觉英文效果还是不太行，我用了1.7万条英文。

Plachtaa commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调，batch_size=32，69个epoch后效果如下： text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的，作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看，中文改善了许多，音色也比较像，但是英文语速还是很快，感觉很多连读或者漏读且音色不像，貌似是底模（CJE)英文效果不太行的缘故？如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊，求大佬指点@Plachtaa

我的建议是尽量不要在一个VITS模型里训练太多种语言，一个模型只管一种语言效果是最好的，因为小模型容量有限。

NsLearning commented 1 year ago

我用了大约1.7万条英文+ 两万条中文微调，batch_size=32，69个epoch后效果如下： text -> [ZH]辅助训练数据是从预训练的大数据集抽样得到的，作用在于防止模型在标注不准确的数据上形成错误映射。[ZH][EN] To be honest, I have no idea what to say as examples. [EN] 英文原音https://github.com/NsLearning/tts-work/blob/main/p226_003_mic1.flac 英文角色合成效果https://github.com/NsLearning/tts-work/blob/main/p226-77k.wav 中文原音https://github.com/NsLearning/tts-work/blob/main/SSB00050353.wav 中文角色合成效果https://github.com/NsLearning/tts-work/blob/main/SSB0005-77K.wav 从目前效果来看，中文改善了许多，音色也比较像，但是英文语速还是很快，感觉很多连读或者漏读且音色不像，貌似是底模（CJE)英文效果不太行的缘故？如果想要得到比较好的中英文vits, 是继续微调下去还是从头开始训练中英文底模啊，求大佬指点@Plachtaa

我的建议是尽量不要在一个VITS模型里训练太多种语言，一个模型只管一种语言效果是最好的，因为小模型容量有限。

那么硬要训练中英双语的话，最好重新开始训练底模对吧，毕竟日语占有一定容量

NsLearning commented 1 year ago

thanks for the great work!!! I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below: https://github.com/NsLearning/tts-work/blob/main/Obama_EN_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_EN_2.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_2.wav For ZH tts, Chinese character has a better work, same as English tts, and now it is not bad to synthesis both ZH and EN, so grateful for the author of the project. I know there is still much space of improvement but try other methods in next time.

treya-lin commented 1 year ago

thanks for the great work!!! I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below: https://github.com/NsLearning/tts-work/blob/main/Obama_EN_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_EN_2.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_1.wav https://github.com/NsLearning/tts-work/blob/main/Obama_ZH_2.wav For ZH tts, Chinese character has a better work, same as English tts, and now it is not bad to synthesis both ZH and EN, so grateful for the author of the project. I know there is still much space of improvement but try other methods in next time.

Hi 想问一下，你的训练流程是这样吗？阶段一： 17k 纯英文、20k纯中文数据，200 epoch 阶段二：9k 中英混数据 + 【部分阶段一的数据还是一些额外的纯中文纯英文说话人的数据？】，100epoch 我主要是第二阶段用的数据没太理解，能具体说一下是怎么样的数据构成吗？奥巴马的数据也是额外在第二阶段加入的吗？

mikeyang01 commented 1 year ago

I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below:

Hi~, 您好 what data are you using? Is 17k English VCTK corpus?

shirubei commented 11 months ago

I used the 17k English and 20K Chinese datasets to fine-tuning the effect of Chinese-English pronunciation by nearly 200 epochs, then put another 9K dataset whose characters can speak both Chinese and English, In addition to some target voices. After 100 epochs training , I think it is a acceptable result, shows as below:

Hi~, 您好 what data are you using? Is 17k English VCTK corpus?

同问，另外具体步骤可以分享吗，谢谢。

NsLearning commented 11 months ago

@shirubei @treya-lin @mikeyang01 你们好，中文语料库的aishell 3, 英文语料库用的VCTK，说双语的语音我是抓包microsoft的tts，可以用edge-tts，注意一下访问频率。目标音色的大量语音是用tortoise tts 生成的，可以用几个片段就能克隆音色，缺点是慢。把这些数据放到vits fine-tune训练，这种方式我是自己瞎倒腾的，感觉效果非常一般。

shirubei commented 11 months ago

@shirubei @treya-lin @mikeyang01 你们好，中文语料库的aishell 3, 英文语料库用的VCTK，说双语的语音我是抓包microsoft的tts，可以用edge-tts，注意一下访问频率。目标音色的大量语音是用tortoise tts 生成的，可以用几个片段就能克隆音色，缺点是慢。把这些数据放到vits fine-tune训练，这种方式我是自己瞎倒腾的，感觉效果非常一般。

非常感谢您的回复。我前2周用标准的中文(标贝，有1万条数据)和标准的英文(LJ Speech，这个数据也不少)来训练，完了之后感觉中文和英文都大幅提高了。另外有一点，感觉有某些个中文的字词，就是无法到达满意的发音，稍微注意一点还是感觉像老外的发音，估计还是与底模相关。

Plachtaa / VITS-fast-fine-tuning

作者你好，我直接在你底模基础上加入更多中文和英文数据集，训练出来中英文效果是不是会更好 #220

Plachtaa / VITS-fast-fine-tuning

作者你好，我直接在你底模基础上加入更多中文和英文数据集， 训练出来中英文效果是不是会更好 #220

作者你好，我直接在你底模基础上加入更多中文和英文数据集，训练出来中英文效果是不是会更好 #220