Vocoder Preprocessing Failure

Tomcattwo commented 3 years ago

Hello @blue-fish and all, I am running the demo_toolbox on Win10, under Anaconda3 (run as administrator), env: VoiceClone, using an NVidia GEForce RTS2070 Super on an EVGA 08G-P4-3172-KR card, 8GB GDDR6, using python 3.7, pytorch Win10/CUDA version 11.1, with all other requirements met. The toolbox GUI (demo_toolbox.py) works fine on this setup.

My project is to use the toolbox to clone 15 voices from a computer simulation (to be able to add additional voice material (.wav files) in those voices back into the sim), one voice at a time, using the Single Voice method described in Issue #437 I have been able to preprocess my datasets (see #832 ) and single-voice train them onto the LibriSpeech 295K pretrained synthesizer with good results.

During this experiment, I tried to conduct Vocoder training on dataset V13M (see #832 ), as described in the README.TXT file from the zip file provided by @blue-fish in #437

I used the command line:

python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained

It could not find dataset_root\SV2TTS\vocoder\mels_gta

So I created dataset_root\SV2TTS\vocoder\mels_gta, copied all the mels from dataset_root\SV2TTS\synthesizer\mels into dataset_root\SV2TTS\vocoder\mels_gta and ran it again

I ran into the following issues:

1) While attempting to run vocoder_preprocess.py on the single voice trained synthesizer and dataset V13M, I ran into the Win10 "pickle" issue, in ...\vocoder\train.py. This issue was identical to the pickle error I encountered when doing synthesizer training on the dataset. I solved it in exactly the same way, by recoding ...vocoder\train.py to use the workaround provided here: blue-fish@89a9964 This corrected the pickle issue for vocoder.preprocess.py

2) Next I encountered error in vocoder.preprocess.py: "hparams_debug_string() takes 0 positional arguments but one was given" a) vocoder_preprocess.py imports hparams from synthesizer.hparams b) synthesizer.hparams defines the hparams_debug_string() as "def hparams_debug_string():" in the second to last line c) synthesize.py (which is where the error occurs) includes in line 17: "print(hparams_debug_string(hparams))" By changing this line to: "print(hparams_debug_string())", I was able to clear the error, but I think this may have then caused the next issue

3) When I ran vocoder_preprocess.py again, I received the following:

`(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained Arguments: datasets_root: datasets_root model_dir: synthesizer/saved_models/V13M_LS_pretrained hparams: no_trim: False cpu: False

{'allow_clipping_in_normalization': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['english_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(2, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cuda Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt Tacotron weights loaded from step 297000 Using inputs from: datasets_root\SV2TTS\synthesizer\train.txt datasets_root\SV2TTS\synthesizer\mels datasets_root\SV2TTS\synthesizer\embeds Found 325 samples 0%| | 0/21 [00:00<?, ?it/s] Traceback (most recent call last): File "vocoder_preprocess.py", line 58, in run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp) File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 78, in run_synthesis for i, (texts, mels, embeds, idx) in tqdm(enumerate(data_loader), total=len(dataloader)): File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\tqdm\std.py", line 1185, in iter for obj in iterable: File "C:\Users\Colt_.conda\envs\VoiceClone\lib\site-packages\torch\utils\data\dataloader.py", line 521, in next data = self._nextdata() File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data data = self._datasetfetcher.fetch(index) # may raise StopIteration File "C:\Users\Colt.conda\envs\VoiceClone\lib\site-packages\torch\utils\data_utils\fetch.py", line 47, in fetch return self.collate_fn(data) File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 69, in collate_fn=lambda batch: collate_synthesizer(batch, r), TypeError: collate_synthesizer() missing 1 required positional argument: 'hparams'`

At this point I could not trace the code back any further, but it looks like the hparams are not getting properly sent to vocoder.train.py

If you need any other information, I will try to provide it. ,Please let me know.

Regards, Tomcattwo

Tomcattwo commented 3 years ago

OK, I did a bit more tracing. Based on the above error, in synthesizer\synthesize.py, line 69, I changed the line from:

69 collate_fn=lambda batch: collate_synthesizer(batch, r)

to:

69 collate_fn=lambda batch: collate_synthesizer(batch, r, hparams)

and ran vocoder.preprocess.py using the command line:

python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained

This cleared the collate_synthesizer error, but still failed to run the preprocess. Here is the output I received:

(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
Arguments:
    datasets_root:   datasets_root
    model_dir:       synthesizer/saved_models/V13M_LS_pretrained
    hparams:
    no_trim:         False
    cpu:             False

{'allow_clipping_in_normalization': True,
 'clip_mels_length': True,
 'fmax': 7600,
 'fmin': 55,
 'griffin_lim_iters': 60,
 'hop_size': 200,
 'max_abs_value': 4.0,
 'max_mel_frames': 900,
 'min_level_db': -100,
 'n_fft': 800,
 'num_mels': 80,
 'power': 1.5,
 'preemphasis': 0.97,
 'preemphasize': True,
 'ref_level_db': 20,
 'rescale': True,
 'rescaling_max': 0.9,
 'sample_rate': 16000,
 'signal_normalization': True,
 'silence_min_duration_split': 0.4,
 'speaker_embedding_size': 256,
 'symmetric_mels': True,
 'synthesis_batch_size': 16,
 'trim_silence': True,
 'tts_cleaner_names': ['english_cleaners'],
 'tts_clip_grad_norm': 1.0,
 'tts_decoder_dims': 128,
 'tts_dropout': 0.5,
 'tts_embed_dims': 512,
 'tts_encoder_K': 5,
 'tts_encoder_dims': 256,
 'tts_eval_interval': 500,
 'tts_eval_num_samples': 1,
 'tts_lstm_dims': 1024,
 'tts_num_highways': 4,
 'tts_postnet_K': 5,
 'tts_postnet_dims': 512,
 'tts_schedule': [(2, 0.001, 20000, 12),
                  (2, 0.0005, 40000, 12),
                  (2, 0.0002, 80000, 12),
                  (2, 0.0001, 160000, 12),
                  (2, 3e-05, 320000, 12),
                  (2, 1e-05, 640000, 12)],
 'tts_stop_threshold': -3.4,
 'use_lws': False,
 'utterance_min_duration': 1.6,
 'win_size': 800}
Synthesizer using device: cuda
Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt
Tacotron weights loaded from step 297000
Using inputs from:
        datasets_root\SV2TTS\synthesizer\train.txt
        datasets_root\SV2TTS\synthesizer\mels
        datasets_root\SV2TTS\synthesizer\embeds
Found 325 samples
  0%|                                                                                           | 0/21 [00:00<?, ?it/s]C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)
  0%|                                                                                           | 0/21 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "vocoder_preprocess.py", line 58, in <module>
    run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp)
  File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 87, in run_synthesis
    _, mels_out, _ = model(texts, mels, embeds)
ValueError: too many values to unpack (expected 3)

Here are the relevant lines from synthesize.py:

83 # Parallelize model onto GPUS using workaround due to python bug 84 if device.type == "cuda" and torch.cuda.device_count() > 1: 85 _, mels_out, _ = data_parallel_workaround(model, texts, mels, embeds) 86 else: 87 _, mels_out, _ = model(texts, mels, embeds)

Not sure where to go with this one...I am using a GPU, CUDA 11.1, num_workers=0 (because of Win10 pickle error). Could it be that the mels_out assignment should really be to the data_parallel_workaround rather than to model(text, mels, embed)?
Regards, TC2

netman789 commented 3 years ago

Per earlier comment by blufish, line 87 should read: _, melsout, , _ = model(texts, mels,embeds)

ghost commented 3 years ago

Regarding the latest problem, please see: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/729#issuecomment-816901953

If you don't mind, please submit a pull request containing the modifications needed to make the vocoder preprocess code work.

Tomcattwo commented 3 years ago

Thanks @netman789 and @blue-fish . I will try the #729 solution and test. If everything runs properly, I will then submit pull requests to change train.py (in synthesizer, and vocoder) to fix pickle errors in win10, a pull request to fix synthesize.py for print(hparams_debug_string()) and collate_synthesizer issues and add the #729 fix as a pull request also. Another potential issue: ...\embed\train.py also has num_workers = 8 in line 24. Should this also receive the Win10 pickle workaround fix? If so, I will add a pull request for that fix also. Appreciate the help. R/, TC2

netman789 commented 3 years ago

TC2, if the vocoder_preprocess runs successfully now, I would be interested to know. I have reached an impasse with a different problem. I am running a slightly different dataset and am getting this error:

initializing synthesizer/synthesize Arguments: datasets_root: C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets model_dir: synthesizer/saved_models/pretrained/ hparams: no_trim: False cpu: False

{'allow_clipping_in_normalization': True, 'allow_pickle': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['english_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(1, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cuda Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\pretrained\pretrained.pt Tacotron weights loaded from step 295000 Using inputs from: C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\train.txt C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\mels C:\Users\tsquare\source\repos\RealTimeVoiceClone-blufsh447\toolbox\datasets\SV2TTS\synthesizer\embeds Found 25164 samples Length of dataloader is: 1573 0%| | 0/1573 [00:57<?, ?it/s] Traceback (most recent call last): File "C:\Users\tsquare\source\repos\TomTRTVC\vocoder_preprocess.py", line 65, in run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp) File "C:\Users\tsquare\source\repos\TomTRTVC\synthesizer\synthesize.py", line 89, in runsynthesis , melsout, , = model(texts, mels, embeds) #added addl. per blufish File "C:\Users\tsquare\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, *kwargs) File "C:\Users\tsquare\source\repos\TomTRTVC\synthesizer\models\tacotron.py", line 390, in forward encoder_seq_proj = self.encoder_proj(encoder_seq) File "C:\Users\tsquare\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(input, **kwargs) File "C:\Users\tsquare\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "C:\Users\tsquare\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\functional.py", line 1612, in linear output = input.matmul(weight.t()) RuntimeError: size mismatch, m1: [2640 x 1024], m2: [512 x 128] at C:/w/b/windows/pytorch/aten/src\THC/generic/THCTensorMathBlas.cu:283 Press any key to continue . . . In the past, this matmul error meant that I was trying to run incompatible models or a synthesizer with an encoder or vocoder with synthesizer. But in this case, I am using the pretrained encoder and synthesizer. My suspicion is that the 2nd factor of m1, should be 512 which represents a concatenation of speaker_embedding_size with the encoder output. Instead, somehow the speaker embedding becomes 768 and the concatenation results in 1024. Any ideas?

Tomcattwo commented 3 years ago

@netman789 , My first thought for your issue was hparams. But your hparams look to be the same as mine Then I noticed that your very first line after the command line was: "initializing synthesizer/synthesize"

My run (see below) does not say that...mine goes straight to arguments.

Don't know why it would initialize synthesizer/synthesize Check your folder structure maybe? As you postulate, seems like you are doing vocoder with synthesizer with maybe an incompatible model hence the mat error??

I just ran vocoder_preprocess.py after inserting the #729 solution in synthesize.py. It ran...up to 38% complete, then I got a CUDA out of memory halt. Here is the code:

(VoiceClone) C:\Utilities\SV2TTS>python vocoder_preprocess.py datasets_root --model_dir synthesizer/saved_models/V13M_LS_pretrained
Arguments:
    datasets_root:   datasets_root
    model_dir:       synthesizer/saved_models/V13M_LS_pretrained
    hparams:
    no_trim:         False
    cpu:             False

{'allow_clipping_in_normalization': True,
 'clip_mels_length': True,
 'fmax': 7600,
 'fmin': 55,
 'griffin_lim_iters': 60,
 'hop_size': 200,
 'max_abs_value': 4.0,
 'max_mel_frames': 900,
 'min_level_db': -100,
 'n_fft': 800,
 'num_mels': 80,
 'power': 1.5,
 'preemphasis': 0.97,
 'preemphasize': True,
 'ref_level_db': 20,
 'rescale': True,
 'rescaling_max': 0.9,
 'sample_rate': 16000,
 'signal_normalization': True,
 'silence_min_duration_split': 0.4,
 'speaker_embedding_size': 256,
 'symmetric_mels': True,
 'synthesis_batch_size': 16,
 'trim_silence': True,
 'tts_cleaner_names': ['english_cleaners'],
 'tts_clip_grad_norm': 1.0,
 'tts_decoder_dims': 128,
 'tts_dropout': 0.5,
 'tts_embed_dims': 512,
 'tts_encoder_K': 5,
 'tts_encoder_dims': 256,
 'tts_eval_interval': 500,
 'tts_eval_num_samples': 1,
 'tts_lstm_dims': 1024,
 'tts_num_highways': 4,
 'tts_postnet_K': 5,
 'tts_postnet_dims': 512,
 'tts_schedule': [(2, 0.001, 20000, 12),
                  (2, 0.0005, 40000, 12),
                  (2, 0.0002, 80000, 12),
                  (2, 0.0001, 160000, 12),
                  (2, 3e-05, 320000, 12),
                  (2, 1e-05, 640000, 12)],
 'tts_stop_threshold': -3.4,
 'use_lws': False,
 'utterance_min_duration': 1.6,
 'win_size': 800}
Synthesizer using device: cuda
Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\V13M_LS_pretrained\V13M_LS_pretrained.pt
Tacotron weights loaded from step 297000
Using inputs from:
        datasets_root\SV2TTS\synthesizer\train.txt
        datasets_root\SV2TTS\synthesizer\mels
        datasets_root\SV2TTS\synthesizer\embeds
Found 325 samples
  0%|                                                                                                                                                                | 0/21 [00:00<?, ?it/s]C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)
 38%|█████████████████████████████████████████████████████████▉                                                                                              | 8/21 [00:06<00:10,  1.20it/s]
Traceback (most recent call last):
  File "vocoder_preprocess.py", line 58, in <module>
    run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp)
  File "C:\Utilities\SV2TTS\synthesizer\synthesize.py", line 87, in run_synthesis
    _, mels_out, _, _ = model(texts, mels, embeds)
  File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Utilities\SV2TTS\synthesizer\models\tacotron.py", line 406, in forward
    postnet_out = self.postnet(mel_outputs)
  File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Utilities\SV2TTS\synthesizer\models\tacotron.py", line 161, in forward
    x, _ = self.rnn(x)
  File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Colt_\.conda\envs\VoiceClone\lib\site-packages\torch\nn\modules\rnn.py", line 838, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 122.00 MiB (GPU 0; 8.00 GiB total capacity; 5.94 GiB already allocated; 0 bytes free; 6.11 GiB reserved in total by PyTorch)

Then I tried again with the --cpu argument. Code said cpu = true, but the code after hparams stated: "Synthesizer using device: cuda", and it failed again on a CUDA out of memory error at 38%.

But it did run... Time to hit the sack. R/ TC2

ghost commented 3 years ago

Then I tried again with the --cpu argument. Code said cpu = true, but the code after hparams stated: "Synthesizer using device: cuda", and it failed again on a CUDA out of memory error at 38%.

It seems the command line option is not successfully forcing CPU use. Try changing this line to:

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

netman789 commented 3 years ago

For a fixed model size, the Only way I know of to get around OOM is to cut the sample size.

Tomcattwo commented 3 years ago

@blue-fish Thanks I put in the fix you suggested and the vocoder_preprocess.py worked properly in the cpu. I will put in the pull requests.

Next I will try vocoder_train.py

@netman789 Thanks. Reducing sample size (to 1/3 of the total samples) was my "Plan B", then run the preprocessor 3 times (once for each batch of samples) and combine the output results manually. R/, TC2

Tomcattwo commented 3 years ago

I was able to train the vocoder on top of the pretrained WaveRNN vocoder. Took about 25 min, starting at step 1159000 on the pretrained WaveRNV file Loss started at 2.8245 running about 1.2 steps/sec using the CUDA, batch size 100 LR 0.0001 Sequence Len 1000, 4 steps per Epoch. It rapidly converged to loss of ~2.53-2.54 Not seeing much improvement. It stopped on its own at Epoch349 loss = 2.5131.

Tomcattwo commented 3 years ago

Pull request #838 submitted for all of the above fixes. This issue is ready to be closed.

CorentinJ / Real-Time-Voice-Cloning

Vocoder Preprocessing Failure #833