💡 TTS 小样本 finetune / 声音克隆问题汇总

PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.

https://paddlespeech.readthedocs.io

Apache License 2.0

11.13k stars 1.85k forks source link

💡 TTS 小样本 finetune / 声音克隆问题汇总 #2456

Open yt605155624 opened 2 years ago

yt605155624 commented 2 years ago

如果 12 句 finetune 效果不佳，一般是因为数据集太小了，建议增加数据集，一般是 300 ~ 600 条，数据量和质量越好，合成的效果越好 数据的质量要求没有混响，没有杂音，离麦克风距离适中，具体可以参考标贝的数据质量。 finetune 出来的音色与目标说话人和原始说话人的相似度有关，即目标说话人和原始说话人相似度越高，finetune 出来的音色更接近目标说话人。 finetune 出来的音频质量与原始说话人的音频质量有关，原始说话人的音频质量不好，finetune 出来的效果也可能不好。综上，finetune 方案在数据采集，选择原始说话人上需要好好选择。

小样本 finetune 原理参考关于训练一个自己的TTS模型

https://github.com/PaddlePaddle/PaddleSpeech/issues/2437
https://github.com/PaddlePaddle/PaddleSpeech/issues/2454
https://github.com/PaddlePaddle/PaddleSpeech/issues/2383
预处理都没有问题，为什么不跑训练流程？-> epoch 的设置有问题，参考： https://github.com/PaddlePaddle/PaddleSpeech/issues/2319#issuecomment-1231618015
https://github.com/PaddlePaddle/PaddleSpeech/issues/2442
https://github.com/PaddlePaddle/PaddleSpeech/issues/2471 -> 安装 develop 版本的 paddlespeech
https://github.com/PaddlePaddle/PaddleSpeech/issues/2245
https://github.com/PaddlePaddle/PaddleSpeech/issues/2485 -> 安装 develop 版本的 paddlespeech
https://github.com/PaddlePaddle/PaddleSpeech/issues/2583 -> 推荐使用 finetune 方案
https://github.com/PaddlePaddle/PaddleSpeech/issues/2586
https://github.com/PaddlePaddle/PaddleSpeech/issues/2607
https://github.com/PaddlePaddle/PaddleSpeech/issues/2790
https://github.com/PaddlePaddle/PaddleSpeech/issues/2953

UserName-wang commented 2 years ago

./run.sh --stage 0 --stop-stage 5 check oov get mfa result sh: 1: mfa_align: Exec format error generate durations.txt extract feature [nltk_data] Error loading averaged_perceptron_tagger: <urlopen error [nltk_data] [Errno 111] Connection refused> [nltk_data] Error loading cmudict: <urlopen error [Errno 111] [nltk_data] Connection refused> 196 1 100%|███████████████████████████████████████████████████████████████████████████████████| 196/196 [00:00<00:00, 5146.26it/s] Done Traceback (most recent call last): File "local/extract_feature.py", line 346, in extract_feature( File "local/extract_feature.py", line 266, in extract_feature normalize(speech_scaler, pitch_scaler, energy_scaler, vocab_phones, File "local/extract_feature.py", line 155, in normalize dataset = DataTable( File "/home/nx/study/python/Paddle24/PaddleSpeech/paddlespeech/t2s/datasets/data_table.py", line 45, in init assert len(data) > 0, "This dataset has no examples" AssertionError: This dataset has no examples

The code in File "/home/nx/study/python/Paddle24/PaddleSpeech/paddlespeech/t2s/datasets/data_table.py", line 45: self.data = data assert len(data) > 0, "This dataset has no examples"

yt605155624 commented 2 years ago

@UserName-wang follow https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md to download nltk_data to your ${HOME}

exceedzhang commented 1 year ago

按照PaddleSpeech/examples/other/tts_finetune/tts3 进行小样本训练

运行run_mix.sh提示如下错误： root@autodl-container-9db311a83c-4d0bf061:~/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3# ./run_mix.sh check oov get mfa result align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Setting up corpus information... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 /root/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Creating dictionary information... Setting up training data... Calculating MFCCs... Calculating CMVN... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 Done with setup. 100%|########################################################################################################| 2/2 [00:02<00:00, 1.01s/it] Done! Everything took 6.651328802108765 seconds generate durations.txt Traceback (most recent call last): File "local/generate_duration.py", line 38, in gen_duration_from_textgrid(mfa_dir, duration_file, fs, n_shift) File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 76, in gen_duration_from_textgrid durations_dict[name] = (speaker, readtg( File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 29, in readtg for interval in alignment.tierDict["phones"].entryList: AttributeError: 'Textgrid' object has no attribute 'tierDict'

使用Python 3.8版本

zhouzyc commented 1 year ago

按照PaddleSpeech/examples/other/tts_finetune/tts3 进行小样本训练

运行run_mix.sh提示如下错误： root@autodl-container-9db311a83c-4d0bf061:~/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3# ./run_mix.sh check oov get mfa result align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Setting up corpus information... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 /root/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Creating dictionary information... Setting up training data... Calculating MFCCs... Calculating CMVN... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 Done with setup. 100%|########################################################################################################| 2/2 [00:02<00:00, 1.01s/it] Done! Everything took 6.651328802108765 seconds generate durations.txt Traceback (most recent call last): File "local/generate_duration.py", line 38, in gen_duration_from_textgrid(mfa_dir, duration_file, fs, n_shift) File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 76, in gen_duration_from_textgrid durations_dict[name] = (speaker, readtg( File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 29, in readtg for interval in alignment.tierDict["phones"].entryList: AttributeError: 'Textgrid' object has no attribute 'tierDict'

使用Python 3.8版本

我用时的3.7.9一样问题，请问解决了把，ubuntu22 微信截图_20230217164927

maize-j commented 1 year ago

按照PaddleSpeech/examples/other/tts_finetune/tts3 进行小样本训练运行run_mix.sh提示如下错误： root@autodl-container-9db311a83c-4d0bf061:~/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3# ./run_mix.sh check oov get mfa result align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Setting up corpus information... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 /root/autodl-tmp/PaddleSpeech/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Creating dictionary information... Setting up training data... Calculating MFCCs... Calculating CMVN... Number of speakers in corpus: 1, average number of utterances per speaker: 12.0 Done with setup. 100%|########################################################################################################| 2/2 [00:02<00:00, 1.01s/it] Done! Everything took 6.651328802108765 seconds generate durations.txt Traceback (most recent call last): File "local/generate_duration.py", line 38, in gen_duration_from_textgrid(mfa_dir, duration_file, fs, n_shift) File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 76, in gen_duration_from_textgrid durations_dict[name] = (speaker, readtg( File "/root/autodl-tmp/PaddleSpeech/utils/gen_duration_from_textgrid.py", line 29, in readtg for interval in alignment.tierDict["phones"].entryList: AttributeError: 'Textgrid' object has no attribute 'tierDict' 使用Python 3.8版本

我用时的3.7.9一样问题，请问解决了把，ubuntu22

看下praatio的版本是不是5.0.0

yt605155624 commented 1 year ago

@zhouzyc @maize-j 可能是 praatio 的不兼容升级导致的 https://github.com/timmahrt/praatIO/blob/main/UPGRADING.md#version-5-to-6-migration

maize-j commented 1 year ago

@zhouzyc @maize-j 可能是 praatio 的不兼容升级导致的 https://github.com/timmahrt/praatIO/blob/main/UPGRADING.md#version-5-to-6-migration

是的，现在安装的时候praatio默认是6.0.0，版本没有向下兼容，就会出现这个问题，改回5.0.0就好了

yt605155624 commented 1 year ago

@zhouzyc @maize-j 可能是 praatio 的不兼容升级导致的 https://github.com/timmahrt/praatIO/blob/main/UPGRADING.md#version-5-to-6-migration

fixed by https://github.com/PaddlePaddle/PaddleSpeech/pull/2970

Rapheal-Madfrog commented 1 year ago

在docker里，get_frontend有一步是下载文件，589MB的，估计是bert的ckpt吧，每次进镜像都要重新下载，项目里实在是没找到相关代码，请问这个589m的文件是从哪里下的，有什么作用，放到哪里？我好本地下载一下，挂载进去，不要再每次都下载了。。

Rapheal-Madfrog commented 1 year ago

在docker里，get_frontend有一步是下载文件，589MB的，估计是bert的ckpt吧，每次进镜像都要重新下载，项目里实在是没找到相关代码，请问这个589m的文件是从哪里下的，有什么作用，放到哪里？我好本地下载一下，挂载进去，不要再每次都下载了。。

已解决，挂载docker里/root/下的三个文件夹，nltk_data、.paddlenlp、.paddlespeech 这个589MB的是G2PWModel_1.1.zip，不可只保留G2PWModel_1.1/删zip，删了会重下。。。

joisonwk commented 1 year ago

./run.sh --stage 0 --stop-stage 5 check oov get mfa result Setting up corpus information... Number of speakers in corpus: 1, average number of utterances per speaker: 688.0 Creating dictionary information... Setting up corpus_data directory... Generating base features (mfcc)... Calculating CMVN... Done with setup. There were 1 segments/files not aligned. Please see ./mfa_result/unaligned.txt for more details on why alignment failed for these files. Done! Everything took 53.481459617614746 seconds generate durations.txt extract feature 686 1 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 686/686 [00:00<00:00, 8198.77it/s]Done Traceback (most recent call last): File "local/extract_feature.py", line 346, in extract_feature( File "local/extract_feature.py", line 266, in extract_feature normalize(speech_scaler, pitch_scaler, energy_scaler, vocab_phones, File "local/extract_feature.py", line 155, in normalize dataset = DataTable( File "/mnt/d/voice/PaddleSpeech/paddlespeech/t2s/datasets/data_table.py", line 47, in init assert len(data) > 0, "This dataset has no examples" AssertionError: This dataset has no examples

(venv) ant@DESKTOP-MEKU9AN:/mnt/d/voice/PaddleSpeech/examples/other/tts_finetune/tts3$ ls ~/nltk_data/ corpora taggers

@UserName-wang follow https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md to download nltk_data to your ${HOME}

我这个已经下载nltk_data到home目录了还是提示这个错误，是什么原因呢？