PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
https://paddlespeech.readthedocs.io
Apache License 2.0
10.99k stars 1.83k forks source link

向 aishell3 里添加自己的音频数据进行训练 #2319

Closed sixyang closed 2 years ago

sixyang commented 2 years ago

如题,tts 中,我想给 aishell3 数据里额外添加一些数据来进行训练(采样率相同),对于 am 和 voc,请问我除了需要 '文本内容' 和 '音频数据' 外,还需要其他东西吗?我看到其他 issue 里面说,直接给 aishell3 的数据里面加一个 speaker_id 即可,那除此之外的步骤能大概描述一下吗?非常感谢!

sixyang commented 2 years ago

发现已经有了文档, AM finetune文档 VOC finetune 文档文档 非常感谢!

sixyang commented 2 years ago

遇到一个问题,是 mfa_align 的,报错如下所示:

root@container-49581189ae-dcc2b933:~/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3# ./run.sh 
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.complex,
mfa_align input/csmsc_mini/newdir tools/aligner/simple.lexicon tools/aligner/aishell3_model.zip mfa_result
align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 198.0
/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Traceback (most recent call last):
  File "aligner/command_line/align.py", line 186, in <module>
  File "aligner/command_line/align.py", line 142, in validate_args
  File "aligner/command_line/align.py", line 94, in align_corpus
  File "aligner/aligner/pretrained.py", line 74, in __init__
  File "aligner/aligner/pretrained.py", line 122, in setup
  File "aligner/aligner/base.py", line 89, in setup
  File "aligner/corpus.py", line 979, in initialize_corpus
  File "aligner/corpus.py", line 852, in create_mfccs
  File "aligner/corpus.py", line 863, in _combine_feats
FileNotFoundError: [Errno 2] No such file or directory: '/root/Documents/MFA/newdir/train/mfcc/raw_mfcc.0.scp'
[54706] Failed to execute script align
159 20
100%|██████████████████████████████████████████████████████████████████████████| 159/159 [00:00<00:00, 18030.51it/s]
Done
Traceback (most recent call last):
  File "finetune.py", line 179, in <module>
    extract_feature(duration_file, config, input_dir, dump_dir,
  File "/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/local/extract.py", line 251, in extract_feature
    normalize(speech_scaler, pitch_scaler, energy_scaler, vocab_phones,
  File "/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/local/extract.py", line 141, in normalize
    dataset = DataTable(
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/datasets/data_table.py", line 45, in __init__
    assert len(data) > 0, "This dataset has no examples"
AssertionError: This dataset has no examples

定位到问题是计算 MFCC 时出问题,从网上搜索得到解法(参考 14991)。

即不要使用原来自带的 libopenblas.so.0,安装最新的包。

yt605155624 commented 2 years ago

发现已经有了文档, AM finetune文档 VOC finetune 文档文档 非常感谢!

此处 voc finetune 并不是指用自己的数据集 finetune,而是 hifigan 论文中提到的用 am 生成的 mel finetune

sixyang commented 2 years ago

发现已经有了文档, AM finetune文档 VOC finetune 文档文档 非常感谢!

此处 voc finetune 并不是指用自己的数据集 finetune,而是 hifigan 论文中提到的用 am 生成的 mel finetune

那加入自己的数据进行 finetune 就直接用 AM 的那个文档就够了是吗?

sixyang commented 2 years ago

我加入自己数据进行的时候报了这个错误,请问是什么问题啊?

root@container-49581189ae-dcc2b933:~/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3# ./run.sh 
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.complex,
align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 16.0
/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Calculating CMVN...
Number of speakers in corpus: 1, average number of utterances per speaker: 16.0
Done with setup.
100%|#######################################################################################################| 2/2 [00:02<00:00,  1.28s/it]
Done! Everything took 11.016074180603027 seconds
13 2
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  4.51it/s]
Done
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 601.37it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.51it/s]
Done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 337.56it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.54it/s]
Done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 409.64it/s]
rank: 0, pid: 325888, parent_pid: 325882
multiple speaker fastspeech2!
spk_num: 174
samplers done!
dataloaders done!
vocab_size: 306
W0828 18:51:54.657121 325888 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.5, Runtime API Version: 11.2
W0828 18:51:54.660507 325888 device_context.cc:465] device: 0, cuDNN Version: 8.1.
model done!
optimizer done!
Exception in main training loop: 
Traceback (most recent call last):
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/trainer.py", line 149, in run
    update()
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/updaters/standard_updater.py", line 107, in update
    batch = self.read_batch()
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/updaters/standard_updater.py", line 180, in read_batch
    batch = next(self.train_iterator)
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 722, in __next__
    six.reraise(*sys.exc_info())
  File "/root/miniconda3/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
    data = self._reader.read_next_var_list()
Trainer extensions will try to handle the extension. Then all extensions will finalize.Traceback (most recent call last):
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/updaters/standard_updater.py", line 177, in read_batch
    batch = next(self.train_iterator)
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 722, in __next__
    six.reraise(*sys.exc_info())
  File "/root/miniconda3/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
    data = self._reader.read_next_var_list()
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "finetune.py", line 194, in <module>
    train_sp(train_args, config)
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/exps/fastspeech2/train.py", line 165, in train_sp
    trainer.run()
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/trainer.py", line 198, in run
    six.reraise(*exc_info)
  File "/root/miniconda3/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/trainer.py", line 149, in run
    update()
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/updaters/standard_updater.py", line 107, in update
    batch = self.read_batch()
  File "/root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/training/updaters/standard_updater.py", line 180, in read_batch
    batch = next(self.train_iterator)
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 722, in __next__
    six.reraise(*sys.exc_info())
  File "/root/miniconda3/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/root/miniconda3/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
    data = self._reader.read_next_var_list()
StopIteration

我的数据是自己用麦克风录制的,然后用 ffmpeg 从 m4a 转换成了 wav,frame_rate 和 aishell3 一样是 44.1khz。 受限于只能传输 25MB 文件,这里提供百度网盘的链接(https://pan.baidu.com/s/1fW_zLvWFu4u61QK_j5dxfA 提取码: 64gu)

跑这个的时候先用你们的 cscms_mini 数据跑了一下流程,是通的,但再跑自己的数据就不行了。我删除了 cscms 的几条数据再跑,还是正常运行,就推测不是缓存的问题。 感谢!

lym0302 commented 2 years ago

尝试一下把batch_size 改小点,改成4,因为你的数据只有10几条。默认的是batch_size 是64。

sixyang commented 2 years ago

尝试一下把batch_size 改小点,改成4,因为你的数据只有10几条。默认的是batch_size 是64。

可行的!非常感谢!

sixyang commented 2 years ago

请问,我明明设置的 stop_stage 比较高,但是根本就不跑训练流程,这是为什么啊? stop_stage 调为多少都没用,加载的模型是自己训练过一点的 snapshot_iter_96800.pdz,原来你们提供的是 snapshot_iter_96400.pdz

root@container-49581189ae-dcc2b933:~/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3# ./run.sh 
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.complex,
align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 83.0
/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Calculating CMVN...
Number of speakers in corpus: 1, average number of utterances per speaker: 83.0
Done with setup.
100%|#####################################################################################################| 2/2 [00:04<00:00,  2.13s/it]
Done! Everything took 22.48857855796814 seconds
67 9
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:08<00:00,  7.87it/s]
Done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 579.46it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  9.17it/s]
Done
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 185.14it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.90it/s]
Done
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 519.07it/s]
rank: 0, pid: 116695, parent_pid: 116692
multiple speaker fastspeech2!
spk_num: 174
samplers done!
dataloaders done!
vocab_size: 306
W0830 16:50:32.274386 116695 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.5, Runtime API Version: 11.2
W0830 16:50:32.277689 116695 device_context.cc:465] device: 0, cuDNN Version: 8.1.
model done!
optimizer done!
in hifigan syn_e2e
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.complex,
========Args========
am: fastspeech2_aishell3
am_ckpt: ./exp/default/checkpoints/snapshot_iter_97800.pdz
am_config: ./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0/default.yaml
am_stat: ./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0/speech_stats.npy
inference_dir: null
lang: zh
ngpu: 1
output_dir: ./test_e2e
phones_dict: ./dump/phone_id_map.txt
speaker_dict: ./dump/speaker_id_map.txt
spk_id: 0
text: /root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/exps/fastspeech2/../sentences.txt
tones_dict: null
voc: hifigan_aishell3
voc_ckpt: pretrained_models/hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz
voc_config: pretrained_models/hifigan_aishell3_ckpt_0.2.0/default.yaml
voc_stat: pretrained_models/hifigan_aishell3_ckpt_0.2.0/feats_stats.npy
sixyang commented 2 years ago

请问,我明明设置的 stop_stage 比较高,但是根本就不跑训练流程,这是为什么啊? stop_stage 调为多少都没用,加载的模型是自己训练过一点的 snapshot_iter_96800.pdz,原来你们提供的是 snapshot_iter_96400.pdz

root@container-49581189ae-dcc2b933:~/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3# ./run.sh 
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations[](https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)
  dtype=np.complex,
align.py:60: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load[](https://msg.pyyaml.org/load) for full details.
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 83.0
/root/autodl-tmp/PaddleSpeech-develop/examples/other/tts_finetune/tts3/tools/montreal-forced-aligner/lib/aligner/models.py:87: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load[](https://msg.pyyaml.org/load) for full details.
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Calculating CMVN...
Number of speakers in corpus: 1, average number of utterances per speaker: 83.0
Done with setup.
100%|#####################################################################################################| 2/2 [00:04<00:00,  2.13s/it]
Done! Everything took 22.48857855796814 seconds
67 9
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:08<00:00,  7.87it/s]
Done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:00<00:00, 579.46it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  9.17it/s]
Done
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 185.14it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.90it/s]
Done
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 519.07it/s]
rank: 0, pid: 116695, parent_pid: 116692
multiple speaker fastspeech2!
spk_num: 174
samplers done!
dataloaders done!
vocab_size: 306
W0830 16:50:32.274386 116695 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.5, Runtime API Version: 11.2
W0830 16:50:32.277689 116695 device_context.cc:465] device: 0, cuDNN Version: 8.1.
model done!
optimizer done!
in hifigan syn_e2e
/root/miniconda3/lib/python3.8/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations[](https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)
  dtype=np.complex,
========Args========
am: fastspeech2_aishell3
am_ckpt: ./exp/default/checkpoints/snapshot_iter_97800.pdz
am_config: ./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0/default.yaml
am_stat: ./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0/speech_stats.npy
inference_dir: null
lang: zh
ngpu: 1
output_dir: ./test_e2e
phones_dict: ./dump/phone_id_map.txt
speaker_dict: ./dump/speaker_id_map.txt
spk_id: 0
text: /root/autodl-tmp/PaddleSpeech-develop/paddlespeech/t2s/exps/fastspeech2/../sentences.txt
tones_dict: null
voc: hifigan_aishell3
voc_ckpt: pretrained_models/hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz
voc_config: pretrained_models/hifigan_aishell3_ckpt_0.2.0/default.yaml
voc_stat: pretrained_models/hifigan_aishell3_ckpt_0.2.0/feats_stats.npy

找到问题了,参数 epoch 默认设置的 100,这里调高一下就可以了。但还是很好奇,按道理这应该是接着前面的预训练模型接着训练,即 stage 和 stop_stage 应该可以影响训练进程,但这里还需要额外加上 epoch。

yt605155624 commented 2 years ago

原因是你 snapshot_iter_96800.pdz 里面的 epoch 数已经达到要求了(大于 default.yaml 里面的 epoch + args.epoch) ,程序认为训练完成了 https://github.com/PaddlePaddle/PaddleSpeech/blob/e147b96cf08df04f079105377d2348933dec5f0b/examples/other/tts_finetune/tts3/finetune.py#L150

可以 paddle.load() snapshot_iter_96800.pdz 和 snapshot_iter_96400.pdz 看看 'ckpt' 的值

sixyang commented 2 years ago

原因是你 snapshot_iter_96800.pdz 里面的 epoch 数已经达到要求了(大于 default.yaml 里面的 epoch + args.epoch) ,程序认为训练完成了

https://github.com/PaddlePaddle/PaddleSpeech/blob/e147b96cf08df04f079105377d2348933dec5f0b/examples/other/tts_finetune/tts3/finetune.py#L150

可以 paddle.load() snapshot_iter_96800.pdz 和 snapshot_iter_96400.pdz 看看 'ckpt' 的值

好的,谢谢!