Closed RAYTRAC3R closed 3 years ago
Yeah... that's a new one. I can't see enough information here to identify the cause.
I'm still getting this error. I think it's connected to FP16, based on this instance of someone getting a similar error. https://github.com/pytorch/pytorch/issues/47138
I tried turning off FP16, and I got a little further, but I ended up bumping into a whole separate error.
File "train.py", line 933, in <module>
train(args, args.rank, args.group_name, hparams)
File "train.py", line 749, in train
optimizer.step()
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 119, in step
group['eps']
File "/usr/local/lib/python3.6/dist-packages/torch/optim/functional.py", line 86, in adam
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: The size of tensor a (1313) must match the size of tensor b (2) at non-singleton dimension 1
Epoch:: 73% 1092/1500 [01:17<00:28, 14.08epoch/s]
Iter: : 0% 0/28 [01:17<?, ?iter/s]
/content/cookietts/CookieTTS/utils/torchmoji/model_def.py:193: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
input_lengths = torch.LongTensor([torch.max(input_seqs[i, :].data.nonzero()) + 1 for i in range(input_seqs.size()[0])])
/content/cookietts/CookieTTS/utils/torchmoji/model_def.py:193: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
input_lengths = torch.LongTensor([torch.max(input_seqs[i, :].data.nonzero()) + 1 for i in range(input_seqs.size()[0])])
/content/cookietts/CookieTTS/utils/torchmoji/model_def.py:193: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
input_lengths = torch.LongTensor([torch.max(input_seqs[i, :].data.nonzero()) + 1 for i in range(input_seqs.size()[0])])```
You'll have to use --warm_start_force
to deal with the 2nd error, looks like the optimizer has changed from your checkpoints version for some reason.
Got some more training done by warm starting the model, now whenever I try to do inference with the resulting model, the ngrok page loads okay but then when I try to generate something it crashes with this.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "app.py", line 88, in texttospeech
tts_outdict = t2s.infer(**tts_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/content/cookietts/CookieTTS/_5_infer/t2s_server/text2speech.py", line 526, in infer
outputs = self.tacotron.inference(sequence, text_lengths.repeat_interleave(batch_size_per_text, dim=0), tacotron_speaker_ids, style_input)
File "/content/cookietts/CookieTTS/_2_ttm/tacotron2_tm/model.py", line 1086, in inference
res_embed, zr, r_mu, r_logvar = self.res_enc(gt_mel, rand_sampling=False)# -> [B, embed]
NameError: name 'gt_mel' is not defined```
try again? (new commit should've updated this)
I'm still having the original sigmoid issue, but it's easily fixed by turning off fp16, and every other issue I've mentioned here has been fixed!
Closing issue.
Problem has been patched, and tacotron2_tm
should be compatible with Pytorch 1.7 and Nvidia/Apex AMP
For the past week or two, I've been training in Google Colab using the experimental branch, and it's gone well. I do have to make a few changes to the code for it to function in Colab.
However, I tried to do some more training today, and I've ran into an error that I can't figure out. It was when I ran the training script with my own dataset, using the last checkpoint I had.
I'm training at a 44100 sampling rate, with hop size, window size, etc. adjusted accordingly. I had to adjust the n_speakers and decoder_rnn_dim, and turn off the second decoder, so that my old checkpoints would be compatible.