Closed Edresson closed 1 year ago
I'll review next week.
I have tried this PR on a vits training and I fall into this error:
`['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=0']
['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=1']
thismodule: vctk_old
thismodule: vctk_old
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
> Training Environment:
| > Current device: 0
| > Num. of GPUs: 2
| > Num. of CPUs: 40
| > Num. of Torch Threads: 4
| > Torch seed: 54321
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
> Model has 86476204 parameters
[4m[1m > EPOCH: 0/1000[0m
--> /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000
[1m > TRAINING (2022-11-05 13:15:41) [0m
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
! Run is kept in /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000
Traceback (most recent call last):
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
self._fit()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
self.train_epoch()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
num_optimizers=len(self.optimizer),
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
scaler.scale(loss_dict["loss"]).backward()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
Traceback (most recent call last):
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
self._fit()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
self.train_epoch()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
num_optimizers=len(self.optimizer),
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
scaler.scale(loss_dict["loss"]).backward()
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
slurmstepd: error: *** JOB 2356593 ON veuc11 CANCELLED AT 2022-11-05T13:17:35 ***`
I have the newest version of Coqui TTS and trainer.
I do not get this error when using the main branch.
I attach the config.json. config.txt
@Edresson can check this maybe?
I have tried this PR on a vits training and I fall into this error:
`['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=0'] ['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=1'] thismodule: vctk_old thismodule: vctk_old fatal: not a git repository (or any parent up to mount point /home/usuaris) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). fatal: not a git repository (or any parent up to mount point /home/usuaris) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). > Training Environment: | > Current device: 0 | > Num. of GPUs: 2 | > Num. of CPUs: 40 | > Num. of Torch Threads: 4 | > Torch seed: 54321 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False > Model has 86476204 parameters [4m[1m > EPOCH: 0/1000[0m --> /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000 [1m > TRAINING (2022-11-05 13:15:41) [0m /home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.) normalized, onesided, return_complex) /home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip) ! Run is kept in /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000 Traceback (most recent call last): File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit self._fit() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit self.train_epoch() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step num_optimizers=len(self.optimizer), File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize scaler.scale(loss_dict["loss"]).backward() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn /home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.) normalized, onesided, return_complex) /home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip) Traceback (most recent call last): File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit self._fit() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit self.train_epoch() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step num_optimizers=len(self.optimizer), File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize scaler.scale(loss_dict["loss"]).backward() File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn slurmstepd: error: *** JOB 2356593 ON veuc11 CANCELLED AT 2022-11-05T13:17:35 ***`
I have the newest version of Coqui TTS and trainer.
I do not get this error when using the main branch.
I attach the config.json. config.txt
@GerrySant @erogol Like I said on this PR details "For the fix works, we need to guarantee that the train_step for each optimizer is independent. It means that we need to activate the generator 2 times and we can't cache it. We need to update the Coqui TTS GANs model to attend to this requirement after the merge.". For this reason just use this PR changes will not works. @GerrySant if you want it to work you need to call two times the generator, once for the discriminator loss and other for the generator loss. Otherwise, it will not have grad for the generator weights update (and It will raise the error that you have noticed).
@Edresson how about running the generator step first? Would that help?
@Edresson how about running the generator step first? Would that help?
I think that we can't do that because order matter. First we need to compute the loss for the discriminator and update its weights and then use the discriminator with updated weights to compute the loss of generator. In the past, our HiFi-GAN training achieved worst result than the original implementation and the bug was the order of the optimizers. If my memory is not falling, it was the issue with BWE model as well, then I fixed it for HiFi-GAN and @WeberJulian replicated it to BWE model.
Just closing this for the sake of #89
It fixes #60.
For the fix works, we need to guarantee that the train_step for each optimizer is independent. It means that we need to activate the generator 2 times and we can't cache it. We need to update the Coqui TTS GANs model to attend to this requirement after the merge.
Samples
TTS model without this fix:
https://user-images.githubusercontent.com/28763586/197186627-7c9af6ff-55df-476b-91a8-21eda6755702.mp4
TTS model with this fix:
https://user-images.githubusercontent.com/28763586/197186698-9bbca229-5c29-4c44-9a93-6e0eee91ec8f.mp4