Closed Tomcattwo closed 3 years ago
OK, I did a bit more tracing. Based on the above error, in synthesizer\synthesize.py, line 69, I changed the line from:
69 collate_fn=lambda batch: collate_synthesizer(batch, r)
to:
69 collate_fn=lambda batch: collate_synthesizer(batch, r, hparams)
and ran vocoder.preprocess.py using the command line:
python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
This cleared the collate_synthesizer error, but still failed to run the preprocess. Here is the output I received:
(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
Arguments:
datasets_root: datasets_root
model_dir: synthesizer/saved_models/V13M_LS_pretrained
hparams:
no_trim: False
cpu: False
{'allow_clipping_in_normalization': True,
'clip_mels_length': True,
'fmax': 7600,
'fmin': 55,
'griffin_lim_iters': 60,
'hop_size': 200,
'max_abs_value': 4.0,
'max_mel_frames': 900,
'min_level_db': -100,
'n_fft': 800,
'num_mels': 80,
'power': 1.5,
'preemphasis': 0.97,
'preemphasize': True,
'ref_level_db': 20,
'rescale': True,
'rescaling_max': 0.9,
'sample_rate': 16000,
'signal_normalization': True,
'silence_min_duration_split': 0.4,
'speaker_embedding_size': 256,
'symmetric_mels': True,
'synthesis_batch_size': 16,
'trim_silence': True,
'tts_cleaner_names': ['english_cleaners'],
'tts_clip_grad_norm': 1.0,
'tts_decoder_dims': 128,
'tts_dropout': 0.5,
'tts_embed_dims': 512,
'tts_encoder_K': 5,
'tts_encoder_dims': 256,
'tts_eval_interval': 500,
'tts_eval_num_samples': 1,
'tts_lstm_dims': 1024,
'tts_num_highways': 4,
'tts_postnet_K': 5,
'tts_postnet_dims': 512,
'tts_schedule': [(2, 0.001, 20000, 12),
(2, 0.0005, 40000, 12),
(2, 0.0002, 80000, 12),
(2, 0.0001, 160000, 12),
(2, 3e-05, 320000, 12),
(2, 1e-05, 640000, 12)],
'tts_stop_threshold': -3.4,
'use_lws': False,
'utterance_min_duration': 1.6,
'win_size': 800}
Synthesizer using device: cuda
Trainable Parameters: 30.870M
Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt
Tacotron weights loaded from step 297000
Using inputs from:
datasets_root\SV2TTS\synthesizer\train.txt
datasets_root\SV2TTS\synthesizer\mels
datasets_root\SV2TTS\synthesizer\embeds
Found 325 samples
0%| | 0/21 [00:00<?, ?it/s]C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.)
return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)
0%| | 0/21 [00:03<?, ?it/s]
Traceback (most recent call last):
File "vocoder_preprocess.py", line 58, in <module>
run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp)
File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 87, in run_synthesis
_, mels_out, _ = model(texts, mels, embeds)
ValueError: too many values to unpack (expected 3)
Here are the relevant lines from synthesize.py:
83 # Parallelize model onto GPUS using workaround due to python bug 84 if device.type == "cuda" and torch.cuda.device_count() > 1: 85 _, mels_out, _ = data_parallel_workaround(model, texts, mels, embeds) 86 else: 87 _, mels_out, _ = model(texts, mels, embeds)
Not sure where to go with this one...I am using a GPU, CUDA 11.1, num_workers=0 (because of Win10 pickle error).
Could it be that the mels_out assignment should really be to the data_parallel_workaround rather than to model(text, mels, embed)?
Regards,
TC2
Per earlier comment by blufish, line 87 should read: _, melsout, , _ = model(texts, mels,embeds)
Regarding the latest problem, please see: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/729#issuecomment-816901953
If you don't mind, please submit a pull request containing the modifications needed to make the vocoder preprocess code work.
Thanks @netman789 and @blue-fish . I will try the #729 solution and test. If everything runs properly, I will then submit pull requests to change train.py (in synthesizer, and vocoder) to fix pickle errors in win10, a pull request to fix synthesize.py for print(hparams_debug_string()) and collate_synthesizer issues and add the #729 fix as a pull request also. Another potential issue: ...\embed\train.py also has num_workers = 8 in line 24. Should this also receive the Win10 pickle workaround fix? If so, I will add a pull request for that fix also. Appreciate the help. R/, TC2
TC2, if the vocoder_preprocess runs successfully now, I would be interested to know. I have reached an impasse with a different problem. I am running a slightly different dataset and am getting this error:
initializing synthesizer/synthesize Arguments: datasets_root: C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets model_dir: synthesizer/saved_models/pretrained/ hparams: no_trim: False cpu: False
{'allow_clipping_in_normalization': True, 'allow_pickle': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['english_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(1, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cuda Trainable Parameters: 30.870M
Loading weights at synthesizer\saved_models\pretrained\pretrained.pt
Tacotron weights loaded from step 295000
Using inputs from:
C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\train.txt
C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\mels
C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\embeds
Found 25164 samples
Length of dataloader is: 1573
0%| | 0/1573 [00:57<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\tsquare\source\repos\TomTRTVC\vocoder_preprocess.py", line 65, in
@netman789 , My first thought for your issue was hparams. But your hparams look to be the same as mine Then I noticed that your very first line after the command line was: "initializing synthesizer/synthesize"
My run (see below) does not say that...mine goes straight to arguments.
Don't know why it would initialize synthesizer/synthesize Check your folder structure maybe? As you postulate, seems like you are doing vocoder with synthesizer with maybe an incompatible model hence the mat error??
I just ran vocoder_preprocess.py after inserting the #729 solution in synthesize.py. It ran...up to 38% complete, then I got a CUDA out of memory halt. Here is the code:
(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
Arguments:
datasets_root: datasets_root
model_dir: synthesizer/saved_models/V13M_LS_pretrained
hparams:
no_trim: False
cpu: False
{'allow_clipping_in_normalization': True,
'clip_mels_length': True,
'fmax': 7600,
'fmin': 55,
'griffin_lim_iters': 60,
'hop_size': 200,
'max_abs_value': 4.0,
'max_mel_frames': 900,
'min_level_db': -100,
'n_fft': 800,
'num_mels': 80,
'power': 1.5,
'preemphasis': 0.97,
'preemphasize': True,
'ref_level_db': 20,
'rescale': True,
'rescaling_max': 0.9,
'sample_rate': 16000,
'signal_normalization': True,
'silence_min_duration_split': 0.4,
'speaker_embedding_size': 256,
'symmetric_mels': True,
'synthesis_batch_size': 16,
'trim_silence': True,
'tts_cleaner_names': ['english_cleaners'],
'tts_clip_grad_norm': 1.0,
'tts_decoder_dims': 128,
'tts_dropout': 0.5,
'tts_embed_dims': 512,
'tts_encoder_K': 5,
'tts_encoder_dims': 256,
'tts_eval_interval': 500,
'tts_eval_num_samples': 1,
'tts_lstm_dims': 1024,
'tts_num_highways': 4,
'tts_postnet_K': 5,
'tts_postnet_dims': 512,
'tts_schedule': [(2, 0.001, 20000, 12),
(2, 0.0005, 40000, 12),
(2, 0.0002, 80000, 12),
(2, 0.0001, 160000, 12),
(2, 3e-05, 320000, 12),
(2, 1e-05, 640000, 12)],
'tts_stop_threshold': -3.4,
'use_lws': False,
'utterance_min_duration': 1.6,
'win_size': 800}
Synthesizer using device: cuda
Trainable Parameters: 30.870M
Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt
Tacotron weights loaded from step 297000
Using inputs from:
datasets_root\SV2TTS\synthesizer\train.txt
datasets_root\SV2TTS\synthesizer\mels
datasets_root\SV2TTS\synthesizer\embeds
Found 325 samples
0%| | 0/21 [00:00<?, ?it/s]C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.)
return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)
38%|█████████████████████████████████████████████████████████▉ | 8/21 [00:06<00:10, 1.20it/s]
Traceback (most recent call last):
File "vocoder_preprocess.py", line 58, in <module>
run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp)
File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 87, in run_synthesis
_, mels_out, _, _ = model(texts, mels, embeds)
File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Utilities\SV2TTS\synthesizer\models\tacotron.py", line 406, in forward
postnet_out = self.postnet(mel_outputs)
File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Utilities\SV2TTS\synthesizer\models\tacotron.py", line 161, in forward
x, _ = self.rnn(x)
File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\rnn.py", line 838, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 122.00 MiB (GPU 0; 8.00 GiB total capacity; 5.94 GiB already allocated; 0 bytes free; 6.11 GiB reserved in total by PyTorch)
Then I tried again with the --cpu argument. Code said cpu = true, but the code after hparams stated: "Synthesizer using device: cuda", and it failed again on a CUDA out of memory error at 38%.
But it did run... Time to hit the sack. R/ TC2
Then I tried again with the --cpu argument. Code said cpu = true, but the code after hparams stated: "Synthesizer using device: cuda", and it failed again on a CUDA out of memory error at 38%.
It seems the command line option is not successfully forcing CPU use. Try changing this line to:
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
For a fixed model size, the Only way I know of to get around OOM is to cut the sample size.
@blue-fish Thanks I put in the fix you suggested and the vocoder_preprocess.py worked properly in the cpu. I will put in the pull requests.
Next I will try vocoder_train.py
@netman789 Thanks. Reducing sample size (to 1/3 of the total samples) was my "Plan B", then run the preprocessor 3 times (once for each batch of samples) and combine the output results manually. R/, TC2
I was able to train the vocoder on top of the pretrained WaveRNN vocoder. Took about 25 min, starting at step 1159000 on the pretrained WaveRNV file Loss started at 2.8245 running about 1.2 steps/sec using the CUDA, batch size 100 LR 0.0001 Sequence Len 1000, 4 steps per Epoch. It rapidly converged to loss of ~2.53-2.54 Not seeing much improvement. It stopped on its own at Epoch349 loss = 2.5131.
Pull request #838 submitted for all of the above fixes. This issue is ready to be closed.
Hello @blue-fish and all, I am running the demo_toolbox on Win10, under Anaconda3 (run as administrator), env: VoiceClone, using an NVidia GEForce RTS2070 Super on an EVGA 08G-P4-3172-KR card, 8GB GDDR6, using python 3.7, pytorch Win10/CUDA version 11.1, with all other requirements met. The toolbox GUI (demo_toolbox.py) works fine on this setup.
My project is to use the toolbox to clone 15 voices from a computer simulation (to be able to add additional voice material (.wav files) in those voices back into the sim), one voice at a time, using the Single Voice method described in Issue #437 I have been able to preprocess my datasets (see #832 ) and single-voice train them onto the LibriSpeech 295K pretrained synthesizer with good results.
During this experiment, I tried to conduct Vocoder training on dataset V13M (see #832 ), as described in the README.TXT file from the zip file provided by @blue-fish in #437
I used the command line:
python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
It could not find dataset_root\SV2TTS\vocoder\mels_gta
So I created dataset_root\SV2TTS\vocoder\mels_gta, copied all the mels from dataset_root\SV2TTS\synthesizer\mels into dataset_root\SV2TTS\vocoder\mels_gta and ran it again
I ran into the following issues:
1) While attempting to run vocoder_preprocess.py on the single voice trained synthesizer and dataset V13M, I ran into the Win10 "pickle" issue, in ...\vocoder\train.py. This issue was identical to the pickle error I encountered when doing synthesizer training on the dataset. I solved it in exactly the same way, by recoding ...vocoder\train.py to use the workaround provided here: blue-fish@89a9964 This corrected the pickle issue for vocoder.preprocess.py
2) Next I encountered error in vocoder.preprocess.py: "hparams_debug_string() takes 0 positional arguments but one was given" a) vocoder_preprocess.py imports hparams from synthesizer.hparams b) synthesizer.hparams defines the hparams_debug_string() as "def hparams_debug_string():" in the second to last line c) synthesize.py (which is where the error occurs) includes in line 17: "print(hparams_debug_string(hparams))" By changing this line to: "print(hparams_debug_string())", I was able to clear the error, but I think this may have then caused the next issue
3) When I ran vocoder_preprocess.py again, I received the following:
`(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained Arguments: datasets_root: datasets_root model_dir: synthesizer/saved_models/V13M_LS_pretrained hparams: no_trim: False cpu: False
{'allow_clipping_in_normalization': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['english_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(2, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cuda Trainable Parameters: 30.870M
Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt Tacotron weights loaded from step 297000 Using inputs from: datasets_root\SV2TTS\synthesizer\train.txt datasets_root\SV2TTS\synthesizer\mels datasets_root\SV2TTS\synthesizer\embeds Found 325 samples 0%| | 0/21 [00:00<?, ?it/s] Traceback (most recent call last): File "vocoder_preprocess.py", line 58, in
run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp)
File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 78, in run_synthesis
for i, (texts, mels, embeds, idx) in tqdm(enumerate(data_loader), total=len(dataloader)):
File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\tqdm\std.py", line 1185, in iter
for obj in iterable:
File "C:\Users\Colt_.conda\envs\VoiceClone\lib\site-packages\torch\utils\data\dataloader.py", line 521, in next
data = self._nextdata()
File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
data = self._datasetfetcher.fetch(index) # may raise StopIteration
File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\torch\utils\data_utils\fetch.py", line 47, in fetch
return self.collate_fn(data)
File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 69, in
collate_fn=lambda batch: collate_synthesizer(batch, r),
TypeError: collate_synthesizer() missing 1 required positional argument: 'hparams'`
At this point I could not trace the code back any further, but it looks like the hparams are not getting properly sent to vocoder.train.py
If you need any other information, I will try to provide it. ,Please let me know.
Regards, Tomcattwo