coqui-ai / Trainer

🐸 - A general purpose model trainer, as flexible as it gets
196 stars 116 forks source link

Prevent dangling gradients in multiple-optimizer setup #76

Closed Edresson closed 1 year ago

Edresson commented 2 years ago

It fixes #60.

For the fix works, we need to guarantee that the train_step for each optimizer is independent. It means that we need to activate the generator 2 times and we can't cache it. We need to update the Coqui TTS GANs model to attend to this requirement after the merge.

Samples

TTS model without this fix:

https://user-images.githubusercontent.com/28763586/197186627-7c9af6ff-55df-476b-91a8-21eda6755702.mp4

TTS model with this fix:

https://user-images.githubusercontent.com/28763586/197186698-9bbca229-5c29-4c44-9a93-6e0eee91ec8f.mp4

erogol commented 2 years ago

I'll review next week.

GerrySant commented 1 year ago

I have tried this PR on a vits training and I fall into this error:


`['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=0']
['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=1']
thismodule: vctk_old
thismodule: vctk_old
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 2
 | > Num. of CPUs: 40
 | > Num. of Torch Threads: 4
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False

 > Model has 86476204 parameters

 > EPOCH: 0/1000
 --> /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000

 > TRAINING (2022-11-05 13:15:41) 
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
 ! Run is kept in /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000
Traceback (most recent call last):
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
    self._fit()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
    self.train_epoch()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
    num_optimizers=len(self.optimizer),
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
    scaler.scale(loss_dict["loss"]).backward()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
Traceback (most recent call last):
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
    self._fit()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
    self.train_epoch()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
    num_optimizers=len(self.optimizer),
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
    scaler.scale(loss_dict["loss"]).backward()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
slurmstepd: error: *** JOB 2356593 ON veuc11 CANCELLED AT 2022-11-05T13:17:35 ***`

I have the newest version of Coqui TTS and trainer.

I do not get this error when using the main branch.

I attach the config.json. config.txt

erogol commented 1 year ago

@Edresson can check this maybe?

Edresson commented 1 year ago

I have tried this PR on a vits training and I fall into this error:


`['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=0']
['/home/usuaris/veu/gerard.muniesa/repositories/TTS_080/TTS/TTS/bin/train_tts.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_11_05-131455', '--use_ddp=true', '-gpus', '0,1', '--config_path', '/home/usuaris/scratch/gerard.muniesa/TTS/config_1e2_1e2.json', '--rank=1']
thismodule: vctk_old
thismodule: vctk_old
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /home/usuaris)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 2
 | > Num. of CPUs: 40
 | > Num. of Torch Threads: 4
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False

 > Model has 86476204 parameters

 > EPOCH: 0/1000
 --> /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000

 > TRAINING (2022-11-05 13:15:41) 
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
 ! Run is kept in /home/usuaris/scratch/gerard.muniesa/TTS/multispeaker_vits_ca_1e2_1e2-November-05-2022_01+15PM-0000000
Traceback (most recent call last):
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
    self._fit()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
    self.train_epoch()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
    num_optimizers=len(self.optimizer),
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
    scaler.scale(loss_dict["loss"]).backward()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448224956/work/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py:994: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  grad_norm = torch.nn.utils.clip_grad_norm_(self.master_params(optimizer), grad_clip)
Traceback (most recent call last):
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1500, in fit
    self._fit()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1485, in _fit
    self.train_epoch()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1264, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 1145, in train_step
    num_optimizers=len(self.optimizer),
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/trainer-0.0.16-py3.7.egg/trainer/trainer.py", line 988, in _optimize
    scaler.scale(loss_dict["loss"]).backward()
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/usuaris/veu/gerard.muniesa/conda/envs/TTS_080_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
slurmstepd: error: *** JOB 2356593 ON veuc11 CANCELLED AT 2022-11-05T13:17:35 ***`

I have the newest version of Coqui TTS and trainer.

I do not get this error when using the main branch.

I attach the config.json. config.txt

@GerrySant @erogol Like I said on this PR details "For the fix works, we need to guarantee that the train_step for each optimizer is independent. It means that we need to activate the generator 2 times and we can't cache it. We need to update the Coqui TTS GANs model to attend to this requirement after the merge.". For this reason just use this PR changes will not works. @GerrySant if you want it to work you need to call two times the generator, once for the discriminator loss and other for the generator loss. Otherwise, it will not have grad for the generator weights update (and It will raise the error that you have noticed).

erogol commented 1 year ago

@Edresson how about running the generator step first? Would that help?

Edresson commented 1 year ago

@Edresson how about running the generator step first? Would that help?

I think that we can't do that because order matter. First we need to compute the loss for the discriminator and update its weights and then use the discriminator with updated weights to compute the loss of generator. In the past, our HiFi-GAN training achieved worst result than the original implementation and the bug was the order of the optimizers. If my memory is not falling, it was the issue with BWE model as well, then I fixed it for HiFi-GAN and @WeberJulian replicated it to BWE model.

erogol commented 1 year ago

Just closing this for the sake of #89