如何微调自己的数据？

scriptboy1990 commented 1 year ago

我按照您vits_bert_aishell3那个分支训练了一个基于aishell3数据集的多说话人的底模，然后我想在这个底模之上继续微调一个自己的单人数据集，怎么应该操作呢？

MaxMax2016 commented 1 year ago

把"n_speakers": 174设置为"n_speakers": 0，

把"gin_channels": 256设置为"gin_channels": 0

这样加载多说话预训练模型的时候会丢弃说话人相关的部分

剩下的训练应该和原来差不多，训练的时候要在代码里面添加加载预训练模型的代码，如下

https://github.com/PlayVoice/vits_chinese/blob/bert_vits_aishell3/train.py#L124 像这样加载预训练的net_g&net_d

utils.load_model("AISHELL3_G.pth", net_g) utils.load_model("AISHELL3_D.pth", net_d)

scriptboy1990 commented 1 year ago

把"n_speakers": 174设置为"n_speakers": 0，

把"gin_channels": 256设置为"gin_channels": 0

这样加载多说话预训练模型的时候会丢弃说话人相关的部分

剩下的训练应该和原来差不多，训练的时候要在代码里面添加加载预训练模型的代码，如下

https://github.com/PlayVoice/vits_chinese/blob/bert_vits_aishell3/train.py#L124 像这样加载预训练的net_g&net_d

utils.load_model("AISHELL3_G.pth", net_g) utils.load_model("AISHELL3_D.pth", net_d)

十分感谢回复这么快～自己的数据集只有语音，要训练的话，前端这块操作要自己弄么，这个项目里包括从原始语音到训练数据的一整套代码不。

MaxMax2016 commented 1 year ago

自己的数据集只有语音，要训练的话，前端这块操作要自己弄么，这个项目里包括从原始语音到训练数据的一整套代码不。

不包含，您可以参考 https://github.com/Plachtaa/VITS-fast-fine-tuning 或者 https://github.com/Fatfish588/Dataset_Generator_For_VITS 这个工具来获得音频的标注

scriptboy1990 commented 1 year ago

自己的数据集只有语音，要训练的话，前端这块操作要自己弄么，这个项目里包括从原始语音到训练数据的一整套代码不。

不包含，您可以参考 https://github.com/Plachtaa/VITS-fast-fine-tuning 或者 https://github.com/Fatfish588/Dataset_Generator_For_VITS 这个工具来获得音频的标注

Hi，我按照上面的fast fine tuning那个链接里的一些预处理步骤： python scripts/denoise_audio.py python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large python scripts/resample.py 生成了单个说话人的一些样本，然后输出的train.txt|valid.txt改成了和您仓库里文件格式一样的，这是样本部分的工作。

代码话，按照上面步骤，改了n_speakers和gin_channels，运行python train.py -c configs/bert_vits.json -m bert_vits 第一个报错：有一些音素不在_symbol_to_id里面，我暂时是把这部分报错的样本给剔除掉了。第二个报错：如下图所示，这个是什么原因呢。

MaxMax2016 commented 1 year ago

有数据太短了，短于切片长度：12800 / 256 = 50，即0.8秒

可以在 https://github.com/PlayVoice/vits_chinese/blob/bert_vits_aishell3/data_utils.py#L41

里面，根据音频长度os.path.getsize(audiopath) // (2 * self.hop_length) 丢弃短音频

scriptboy1990 commented 1 year ago

有数据太短了，短于切片长度：12800 / 256 = 50，即0.8秒

可以在 https://github.com/PlayVoice/vits_chinese/blob/bert_vits_aishell3/data_utils.py#L41

里面，根据音频长度os.path.getsize(audiopath) // (2 * self.hop_length) 丢弃短音频

嗯，我去掉了少于1秒的数据，目前可以跑起来。但是我目前还不太清楚，这种方式跑完模型（比如我微调了三个新的说话人），那么微调后的模型是只有这三个新的说话人了吗？还是aishell3的174 + 3个新的说话人。

MaxMax2016 commented 1 year ago

模型微调会遗忘原始的说话人

scriptboy1990 commented 1 year ago

在前面的步骤下，步骤可以训练，也可以正常生成checkpoint。 infer的时候，我设置i=0（因为用了一个说话人），生成了wav文件，文件可以打开，时长是有的，就是没有任何声音。我尝试把n_speakers从0改成1，或者gin_channels从0改成之前的256，infer时候均会报错。

有哪些可能得原因么，我听了训练数据，是都有声音的。下面是train.log的截图

我个人感觉是训练的音频问题，因为aishell3的音频data = wave.open(file_name, mode = 'rb')用这个打开的时候，会正常返回： _wave_params(nchannels=1, sampwidth=2, framerate=16000, nframes=87071, comptype='NONE', compname='not compressed') 但是打开我预处理的音频，会报异常： Traceback (most recent call last): File "check.py", line 21, in data = wave.open(file_name, mode = 'rb') File "/root/conda/envs/vits_chinese/lib/python3.8/wave.py", line 510, in open return Wave_read(f) File "/root/conda/envs/vits_chinese/lib/python3.8/wave.py", line 164, in init self.initfp(f) File "/root/conda/envs/vits_chinese/lib/python3.8/wave.py", line 144, in initfp self._read_fmt_chunk(chunk) File "/root/conda/envs/vits_chinese/lib/python3.8/wave.py", line 269, in _read_fmt_chunk raise Error('unknown format: %r' % (wFormatTag,)) wave.Error: unknown format: 3

scriptboy1990 commented 1 year ago

没声音的问题搞定了，是因为隔壁fft仓库的预处理那边，默认生成的是Floating PCM，要改一下。

PlayVoice / vits_chinese

如何微调自己的数据？ #147