KevinWang676 / Bark-Voice-Cloning

Bark Voice Cloning and Voice Cloning for Chinese Speech
MIT License
2.74k stars 395 forks source link

Colab 跑训练 Training的时候出现问题 #19

Closed waltcow closed 1 year ago

waltcow commented 1 year ago

auto label 过程没出现问题

2023-07-20 02:11:33,456 - modelscope - INFO - Use user-specified model revision: v1.0.5
---  New folder /content/output_training_data/paragraph/prosody...  ---
---  OK  ---
---  New folder /content/output_training_data/sp_interval...  ---
---  OK  ---
---  New folder /content/output_training_data/wav...  ---
---  OK  ---
--- Remove /content/output_training_data/log folder!  ---
---  New folder /content/output_training_data/log...  ---
---  OK  ---
2023-07-20 02:12:03
wav_preprocess start...
---  new folder...  ---
---  OK  ---
100%|██████████| 1/1 [00:00<00:00, 23.94it/s]wav cut by vad start...

100%|██████████| 1/1 [00:00<00:00,  2.59it/s]
100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
Text to label start...
100%|██████████| 1/1 [00:00<00:00,  1.91it/s]
pre-break recording in paragraph by vad.
Generate phone interval by asr align.
---  New folder /content/output_training_data/align...  ---
---  OK  ---
prosody_dir=/content/output_training_data/paragraph/prosody
run_asr_align step 2
speak_script=/content/output_training_data/align/script.txt
job_num=1 process_num=4 fbank_config=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf, data_dir=/content/output_training_data/align/gen/data, fbank_dir=/content/output_training_data/align/gen/fbank
run make_fbank with num=1 config_path=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf
data_path=/content/output_training_data/align/gen/data fbank_path=/content/output_training_data/align/gen/fbank
[{'id': 'test_0_0', 'wav': '/content/output_training_data/wav_cut/test_0_0.wav'}]
100%|██████████| 1/1 [00:01<00:00,  1.90s/it]DONE compute fbank and copy feats
DONE!
job_num=1 process_num=4 data_dir=/content/output_training_data/align/gen/data lm_dir=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/lang am_dir=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2, fbank_dir=/content/output_training_data/align/gen/fbank, align_dir=/content/output_training_data/align/gen/align
[{'id': 'test_0_0', 'ark': '/content/output_training_data/align/gen/fbank/raw_fbank_data.test_0_0.ark', 'scp': '/content/output_training_data/align/gen/fbank/raw_fbank_data.test_0_0.scp'}]

Feature preprocessing start...
100%|██████████| 1/1 [00:05<00:00,  5.20s/it]Waveform aligning start...

100%|██████████| 1/1 [00:01<00:00,  1.62s/it]do_align done!
---  new folder...  ---
---  OK  ---
test_0_0.ali
Trim silence wav with align info and modify wav files....

100%|██████████| 1/1 [00:00<00:00, 80.69it/s]Convert align info to interval files....
---  There is this folder!  ---
test_0_0.ali
Modify sil to sp in interval....
modify interval er phone.
--- Remove /content/output_training_data/interval folder!  ---
---  New folder /content/output_training_data/interval...  ---
---  OK  ---
qualification review.
prosody sillence detect.
--- Remove /content/output_training_data/prosody folder!  ---
---  New folder /content/output_training_data/prosody...  ---
---  OK  ---

average silence duration: 0.3249999999999996
100%|██████████| 2/2 [00:00<00:00, 3506.94it/s]Write prosody file
0 "mismatch" sentences

Auto labeling info: stage 1 | develop mode 0 | gender:female | score 10.000000 | retcode 0
labeling report:
stage 1 | develop mode 0 | gender female | score 10.000000 | retcode 0
qulification report:
credit score: 10.000000
qualified score: 3.000000
normalized snr: 35.000000
abandon utt snr threshold: 10.000000
snr score ration: 0.500000
interval score ration: 0.500000
data qulificaion report:

Training 时出错了


2023-07-20 02:13:16,273 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:17,519 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:18,124 - modelscope - INFO - Set workdir to ./pretrain_work_dir/
2023-07-20 02:13:18,171 - modelscope - INFO - load ./output_training_data/
2023-07-20 02:13:18,561 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:37,195 - modelscope - INFO - am_config=./pretrain_work_dir/orig_model/basemodel_16k/sambert/config.yaml voc_config=./pretrain_work_dir/orig_model/basemodel_16k/hifigan/config.yaml
2023-07-20 02:13:37,197 - modelscope - INFO - audio_config=./pretrain_work_dir/orig_model/basemodel_16k/audio_config_se_16k.yaml
2023-07-20 02:13:37,198 - modelscope - INFO - am_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/sambert/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,200 - modelscope - INFO - voc_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,203 - modelscope - INFO - se_path=./pretrain_work_dir/orig_model/se.npy se_model_path=./pretrain_work_dir/orig_model/basemodel_16k/speaker_embedding/se.onnx
2023-07-20 02:13:37,204 - modelscope - INFO - mvn_path=./pretrain_work_dir/orig_model/mvn.npy
100%|██████████| 2/2 [00:00<00:00, 2823.50it/s]TextScriptConvertor.process:
Save script to: ./pretrain_work_dir/data/Script.xml
TextScriptConvertor.process:
Save metafile to: ./pretrain_work_dir/data/raw_metafile.txt
[AudioProcessor] Initialize AudioProcessor.
[AudioProcessor] config params:
[AudioProcessor] wav_normalize: True
[AudioProcessor] trim_silence: True
[AudioProcessor] trim_silence_threshold_db: 60
[AudioProcessor] preemphasize: False
[AudioProcessor] sampling_rate: 16000
[AudioProcessor] hop_length: 200
[AudioProcessor] win_length: 1000
[AudioProcessor] n_fft: 2048
[AudioProcessor] n_mels: 80
[AudioProcessor] fmin: 0.0
[AudioProcessor] fmax: 8000.0
[AudioProcessor] phone_level_feature: True
[AudioProcessor] se_feature: True
[AudioProcessor] norm_type: mean_std
[AudioProcessor] max_norm: 1.0
[AudioProcessor] symmetric: False
[AudioProcessor] min_level_db: -100.0
[AudioProcessor] ref_level_db: 20
[AudioProcessor] num_workers: 16
[AudioProcessor] Amplitude normalization started
Volume statistic proceeding...

100%|██████████| 1/1 [00:00<00:00,  1.70it/s]
Average amplitude RMS : 0.126146
Volume statistic done.
Volume normalization proceeding...
100%|██████████| 1/1 [00:00<00:00, 530.12it/s]Volume normalization done.
[AudioProcessor] Amplitude normalization finished
[AudioProcessor] Duration generation started

  0%|          | 0/1 [00:00<?, ?it/s][AudioProcessor] Duration align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00,  1.14s/it]
[AudioProcessor] Duration generate finished
[AudioProcessor] Trim silence with interval started
[AudioProcessor] Start to load pcm from ./pretrain_work_dir/data/wav
100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
  0%|          | 0/1 [00:01<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 815.70it/s][AudioProcessor] Trim silence finished
[AudioProcessor] Melspec extraction started

100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
[AudioProcessor] Melspec extraction finished
Melspec statistic proceeding...
100%|██████████| 1/1 [00:00<00:00, 3236.35it/s]
100%|██████████| 1/1 [00:00<00:00, 363.39it/s]Melspec statistic done
[AudioProcessor] melspec mean and std saved to:
./pretrain_work_dir/data/mel/mel_mean.txt,
./pretrain_work_dir/data/mel/mel_std.txt
[AudioProcessor] Melspec mean std norm is proceeding...
[AudioProcessor] Melspec normalization finished
[AudioProcessor] Normed Melspec saved to ./pretrain_work_dir/data/mel
[AudioProcessor] Pitch extraction started

  0%|          | 0/1 [00:00<?, ?it/s][AudioProcessor] Pitch align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00,  1.69s/it]
[AudioProcessor] Pitch normalization is proceeding...
100%|██████████| 1/1 [00:00<00:00, 4128.25it/s]
100%|██████████| 1/1 [00:00<00:00, 3721.65it/s][AudioProcessor] f0 mean and std saved to:
./pretrain_work_dir/data/f0/f0_mean.txt,
./pretrain_work_dir/data/f0/f0_std.txt
[AudioProcessor] Pitch mean std norm is proceeding...
[AudioProcessor] Pitch turn to phone-level is proceeding...

100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
[AudioProcessor] Pitch normalization finished
[AudioProcessor] Normed f0 saved to ./pretrain_work_dir/data/f0
[AudioProcessor] Pitch extraction finished
[AudioProcessor] Energy extraction started
100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
100%|██████████| 1/1 [00:00<00:00, 252.64it/s]
100%|██████████| 1/1 [00:00<00:00, 3682.44it/s][AudioProcessor] energy mean and std saved to:
./pretrain_work_dir/data/energy/energy_mean.txt,
./pretrain_work_dir/data/energy/energy_std.txt
[AudioProcessor] Energy mean std norm is proceeding...

100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
[AudioProcessor] Energy normalization finished
[AudioProcessor] Normed Energy saved to ./pretrain_work_dir/data/energy
[AudioProcessor] Energy extraction finished
[AudioProcessor] All features extracted successfully!
Processing audio done.
[SpeakerEmbeddingProcessor] Speaker embedding extractor started
[SpeakerEmbeddingProcessor] se model loading error!!!
[SpeakerEmbeddingProcessor] please update your se model to ensure that the version is greater than or equal to 1.0.5
[SpeakerEmbeddingProcessor] try load it as se.model
[SpeakerEmbeddingProcessor] Speaker embedding extracted successfully!
Processing speaker embedding done.
Processing done.
Voc metafile generated.
AM metafile generated.
2023-07-20 02:14:06,035 - modelscope - INFO - Start training....
2023-07-20 02:14:06,040 - modelscope - INFO - Start SAMBERT training...
2023-07-20 02:14:06,042 - modelscope - INFO - TRAIN SAMBERT....
2023-07-20 02:14:06,059 - modelscope - INFO - TRAINING steps: 2400202
2023-07-20 02:14:06,069 - modelscope - INFO - audio_config = {'fmax': 8000.0, 'fmin': 0.0, 'hop_length': 200, 'max_norm': 1.0, 'min_level_db': -100.0, 'n_fft': 2048, 'n_mels': 80, 'norm_type': 'mean_std', 'num_workers': 16, 'phone_level_feature': True, 'preemphasize': False, 'ref_level_db': 20, 'sampling_rate': 16000, 'symmetric': False, 'trim_silence': True, 'trim_silence_threshold_db': 60, 'wav_normalize': True, 'win_length': 1000}
2023-07-20 02:14:06,070 - modelscope - INFO - Loss = {'MelReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}, 'ProsodyReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}}
2023-07-20 02:14:06,072 - modelscope - INFO - Model = {'KanTtsSAMBERT': {'optimizer': {'params': {'betas': [0.9, 0.98], 'eps': 1e-09, 'lr': 0.001, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'MAS': False, 'NSF': True, 'SE': True, 'decoder_attention_dropout': 0.1, 'decoder_dropout': 0.1, 'decoder_ffn_inner_dim': 1024, 'decoder_num_heads': 8, 'decoder_num_layers': 12, 'decoder_num_units': 128, 'decoder_prenet_units': [256, 256], 'decoder_relu_dropout': 0.1, 'dur_pred_lstm_units': 128, 'dur_pred_prenet_units': [128, 128], 'embedding_dim': 512, 'emotion_units': 32, 'encoder_attention_dropout': 0.1, 'encoder_dropout': 0.1, 'encoder_ffn_inner_dim': 1024, 'encoder_num_heads': 8, 'encoder_num_layers': 8, 'encoder_num_units': 128, 'encoder_projection_units': 32, 'encoder_relu_dropout': 0.1, 'max_len': 800, 'nsf_f0_global_maximum': 730.0, 'nsf_f0_global_minimum': 30.0, 'nsf_norm_type': 'global', 'num_mels': 82, 'outputs_per_step': 3, 'postnet_dropout': 0.1, 'postnet_ffn_inner_dim': 512, 'postnet_filter_size': 41, 'postnet_fsmn_num_layers': 4, 'postnet_lstm_units': 128, 'postnet_num_memory_units': 256, 'postnet_shift': 17, 'predictor_dropout': 0.1, 'predictor_ffn_inner_dim': 256, 'predictor_filter_size': 41, 'predictor_fsmn_num_layers': 3, 'predictor_lstm_units': 128, 'predictor_num_memory_units': 128, 'predictor_shift': 0, 'speaker_units': 192}, 'scheduler': {'params': {'warmup_steps': 4000}, 'type': 'NoamLR'}}}
2023-07-20 02:14:06,074 - modelscope - INFO - allow_cache = False
2023-07-20 02:14:06,084 - modelscope - INFO - batch_size = 32
2023-07-20 02:14:06,085 - modelscope - INFO - create_time = 2023-07-20 02:14:06
2023-07-20 02:14:06,087 - modelscope - INFO - eval_interval_steps = 10000000000000000
2023-07-20 02:14:06,090 - modelscope - INFO - git_revision_hash = d16755444c9baf23348213211a5ed9035458ecf0
2023-07-20 02:14:06,093 - modelscope - INFO - grad_norm = 1.0
2023-07-20 02:14:06,096 - modelscope - INFO - linguistic_unit = {'cleaners': 'english_cleaners', 'lfeat_type_list': 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category', 'speaker_list': 'F7'}
2023-07-20 02:14:06,098 - modelscope - INFO - log_interval_steps = 50
2023-07-20 02:14:06,099 - modelscope - INFO - model_type = sambert
2023-07-20 02:14:06,100 - modelscope - INFO - num_save_intermediate_results = 4
2023-07-20 02:14:06,101 - modelscope - INFO - num_workers = 4
2023-07-20 02:14:06,102 - modelscope - INFO - pin_memory = False
2023-07-20 02:14:06,105 - modelscope - INFO - remove_short_samples = False
2023-07-20 02:14:06,111 - modelscope - INFO - save_interval_steps = 200
2023-07-20 02:14:06,113 - modelscope - INFO - train_max_steps = 2400202
2023-07-20 02:14:06,115 - modelscope - INFO - train_steps = 202
2023-07-20 02:14:06,119 - modelscope - INFO - log_interval = 10
2023-07-20 02:14:06,121 - modelscope - INFO - modelscope_version = 1.7.1
Loading metafile...
0it [00:00, ?it/s]Loading metafile...

100%|██████████| 1/1 [00:00<00:00, 9198.04it/s]
2023-07-20 02:14:06,139 - modelscope - INFO - The number of training files = 0.
2023-07-20 02:14:06,141 - modelscope - INFO - The number of validation files = 1.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-15-0089498a7012>](https://localhost:8080/#) in <cell line: 33>()
     31                         default_args=kwargs)
     32 
---> 33 trainer.train()
KevinWang676 commented 1 year ago

还没有遇到过这种情况,可以用最新的Colab笔记本再跑一下,昨天更新过一次

waltcow commented 1 year ago

好的我再试试

waltcow commented 1 year ago
image

还是遇到相同的问题 @KevinWang676

KevinWang676 commented 1 year ago

那建议用阿里云笔记本在阿里云的Notebook环境跑一下,是一样的

waltcow commented 1 year ago

我知道问题了,原来是样本的时长太短了,无法跑起来