Open waltcow opened 1 year ago
应该是环境没有配置完整,auto-label的环节对环境版本的要求很严苛,比如numpy、typeguard的版本要求;建议先在Colab或者阿里云notebook上跑一下
方便提供下本地的安装指引吗
暂时没有在本地运行过,之后会研究一下本地的环境配置
折腾了一下午,感觉太难了 @KevinWang676 https://modelscope.cn/models/damo/speech_ptts_autolabel_16k/summary
2023-07-20 08:38:16,233 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-07-20:08:38:16, INFO [api.py:463] Use user-specified model revision: v1.0.5
--- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody) folder! ---
--- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...) ---
--- OK ---
--- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval) folder! ---
--- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...) ---
--- OK ---
--- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/wav...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/wav...) ---
--- OK ---
--- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/log](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/log) folder! ---
--- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/log...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/log...) ---
--- OK ---
2023-07-20 08:38:23
wav_preprocess start...
--- There is this folder! ---
0%| | 0/16 [00:00<?, ?it/s]sox WARN rate: rate clipped 1 samples; decrease volume?
sox WARN dither: dither clipped 1 samples; decrease volume?
100%|██████████| 16/16 [00:00<00:00, 139.72it/s]
wav cut by vad start...
12%|█▎ | 2/16 [00:00<00:01, 10.62it/s]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[8], line 4
1 input_wav = "[./test_wavs/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/test_wavs/)"
2 output_data = "[./output_training_data/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/)"
----> 4 ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5")
File [~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:77](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:77), in run_auto_label(input_wav, work_dir, para_ids, resource_model_id, resource_revision, gender, stage, process_num, develop_mode, has_para, enable_enh)
63 model_resource = _download_and_unzip_resousrce(resource_model_id,
64 resource_revision)
65 auto_labeling = AutoLabeling(
66 os.path.abspath(input_wav),
67 model_resource,
(...)
75 process_num,
76 enable_enh=enable_enh)
---> 77 ret_code, report = auto_labeling.run()
78 return ret_code, report
File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:765](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:765), in AutoLabeling.run(self)
762 self.wav_preprocess()
764 ## cut wav by vad
--> 765 self.wav_cut_by_vad()
767 # get prosody
768 audio_path = glob.glob(os.path.join(self.cut_wav_dir, '*.wav'))
File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:371](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:371), in AutoLabeling.wav_cut_by_vad(self)
369 shutil.rmtree(self.cut_wav_dir)
370 os.makedirs(self.cut_wav_dir, exist_ok=True)
--> 371 vad_cut(self.resample_wav_dir, self.cut_wav_dir, self.resource_dir)
File [~/.local/lib/python3.9/site-packages/tts_autolabel/audiocut/vad.py:73](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audiocut/vad.py:73), in vad_cut(input_wav_dir, output_wav_dir, resource_dir, cut_threshold, start_sil_threshold, end_sil_threshold, max_dur_threshold, min_dur_threshold)
69 min_samples_threshold = int(min_dur_threshold * sample_rate)
71 wavid = os.path.basename(audio_in).split('.')[0]
---> 73 segments_result = vad_pipeline(audio_in=waveform)
74 segments_text = segments_result[0]
76 if len(segments_text) == 0:
File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:94](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:94), in Fsmn_vad.__call__(self, audio_in, **kwargs)
92 end_idx = min(waveform_nums, beg_idx + self.batch_size)
93 waveform = waveform_list[beg_idx:end_idx]
---> 94 feats, feats_len = self.extract_feat(waveform)
95 waveform = np.array(waveform)
96 param_dict = kwargs.get('param_dict', dict())
File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:154](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:154), in Fsmn_vad.extract_feat(self, waveform_list)
152 for waveform in waveform_list:
153 speech, _ = self.frontend.fbank(waveform)
--> 154 feat, feat_len = self.frontend.lfr_cmvn(speech)
155 feats.append(feat)
156 feats_len.append(feat_len)
File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:89](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:89), in WavFrontend.lfr_cmvn(self, feat)
87 def lfr_cmvn(self, feat: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
88 if self.lfr_m != 1 or self.lfr_n != 1:
---> 89 feat = self.apply_lfr(feat, self.lfr_m, self.lfr_n)
91 if self.cmvn_file:
92 feat = self.apply_cmvn(feat)
File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:103](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:103), in WavFrontend.apply_lfr(inputs, lfr_m, lfr_n)
101 T = inputs.shape[0]
102 T_lfr = int(np.ceil(T [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) lfr_n))
--> 103 left_padding = np.tile(inputs[0], ((lfr_m - 1) [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/)[/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) 2, 1))
104 inputs = np.vstack((left_padding, inputs))
105 T = T + (lfr_m - 1) [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/)[/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) 2
IndexError: index 0 is out of bounds for axis 0 with size 0
折腾了一下午,感觉太难了 @KevinWang676 https://modelscope.cn/models/damo/speech_ptts_autolabel_16k/summary
2023-07-20 08:38:16,233 - modelscope - INFO - Use user-specified model revision: v1.0.5 2023-07-20:08:38:16, INFO [api.py:463] Use user-specified model revision: v1.0.5 --- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody) folder! --- --- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...) --- --- OK --- --- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval) folder! --- --- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...) --- --- OK --- --- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/wav...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/wav...) --- --- OK --- --- Remove [/home/mai/Bark-Voice-Cloning/output_training_data/log](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/log) folder! --- --- New folder [/home/mai/Bark-Voice-Cloning/output_training_data/log...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/log...) --- --- OK --- 2023-07-20 08:38:23 wav_preprocess start... --- There is this folder! --- 0%| | 0/16 [00:00<?, ?it/s]sox WARN rate: rate clipped 1 samples; decrease volume? sox WARN dither: dither clipped 1 samples; decrease volume? 100%|██████████| 16/16 [00:00<00:00, 139.72it/s] wav cut by vad start... 12%|█▎ | 2/16 [00:00<00:01, 10.62it/s] --------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[8], line 4 1 input_wav = "[./test_wavs/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/test_wavs/)" 2 output_data = "[./output_training_data/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/)" ----> 4 ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5") File [~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:77](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:77), in run_auto_label(input_wav, work_dir, para_ids, resource_model_id, resource_revision, gender, stage, process_num, develop_mode, has_para, enable_enh) 63 model_resource = _download_and_unzip_resousrce(resource_model_id, 64 resource_revision) 65 auto_labeling = AutoLabeling( 66 os.path.abspath(input_wav), 67 model_resource, (...) 75 process_num, 76 enable_enh=enable_enh) ---> 77 ret_code, report = auto_labeling.run() 78 return ret_code, report File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:765](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:765), in AutoLabeling.run(self) 762 self.wav_preprocess() 764 ## cut wav by vad --> 765 self.wav_cut_by_vad() 767 # get prosody 768 audio_path = glob.glob(os.path.join(self.cut_wav_dir, '*.wav')) File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:371](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:371), in AutoLabeling.wav_cut_by_vad(self) 369 shutil.rmtree(self.cut_wav_dir) 370 os.makedirs(self.cut_wav_dir, exist_ok=True) --> 371 vad_cut(self.resample_wav_dir, self.cut_wav_dir, self.resource_dir) File [~/.local/lib/python3.9/site-packages/tts_autolabel/audiocut/vad.py:73](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audiocut/vad.py:73), in vad_cut(input_wav_dir, output_wav_dir, resource_dir, cut_threshold, start_sil_threshold, end_sil_threshold, max_dur_threshold, min_dur_threshold) 69 min_samples_threshold = int(min_dur_threshold * sample_rate) 71 wavid = os.path.basename(audio_in).split('.')[0] ---> 73 segments_result = vad_pipeline(audio_in=waveform) 74 segments_text = segments_result[0] 76 if len(segments_text) == 0: File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:94](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:94), in Fsmn_vad.__call__(self, audio_in, **kwargs) 92 end_idx = min(waveform_nums, beg_idx + self.batch_size) 93 waveform = waveform_list[beg_idx:end_idx] ---> 94 feats, feats_len = self.extract_feat(waveform) 95 waveform = np.array(waveform) 96 param_dict = kwargs.get('param_dict', dict()) File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:154](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:154), in Fsmn_vad.extract_feat(self, waveform_list) 152 for waveform in waveform_list: 153 speech, _ = self.frontend.fbank(waveform) --> 154 feat, feat_len = self.frontend.lfr_cmvn(speech) 155 feats.append(feat) 156 feats_len.append(feat_len) File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:89](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:89), in WavFrontend.lfr_cmvn(self, feat) 87 def lfr_cmvn(self, feat: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: 88 if self.lfr_m != 1 or self.lfr_n != 1: ---> 89 feat = self.apply_lfr(feat, self.lfr_m, self.lfr_n) 91 if self.cmvn_file: 92 feat = self.apply_cmvn(feat) File [~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:103](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:103), in WavFrontend.apply_lfr(inputs, lfr_m, lfr_n) 101 T = inputs.shape[0] 102 T_lfr = int(np.ceil(T [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) lfr_n)) --> 103 left_padding = np.tile(inputs[0], ((lfr_m - 1) [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/)[/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) 2, 1)) 104 inputs = np.vstack((left_padding, inputs)) 105 T = T + (lfr_m - 1) [/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/)[/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/) 2 IndexError: index 0 is out of bounds for axis 0 with size 0
楼主解决了吗?我也是尝试本地部署。但是卡在了这个auto label步骤。
尝试在本地跑 Voice_Cloning_for_Chinese_Speech.ipynb