Finetuning HiFiGAN will stuck at random step

Describe the bug

I'm trying to finetuning HiFiGAN using this lib, using WSL2 I could start training process, but it will stuck at random step.

Steps/Code to reproduce bug

Running prompts:

LD_LIBRARY_PATH=/home/oedosoldier/anaconda3/envs/fs/lib:$LD_LIBRARY_PATH HYDRA_FULL_ERROR=1 PYTHONPATH=./jiaran//codes/NeMo CUDA_VISIBLE_DEVICES=0 nohup python ./jiaran//codes/NeMo/examples/tts/hifigan_finetune.py train_dataset=./jiaran//metas/nemo/train_manifest_mel.json validation_datasets=./jiaran//metas/nemo/val_manifest_mel.json exp_manager.exp_dir=./jiaran//results model/train_ds=train_ds_finetune model/validation_ds=val_ds_finetune trainer.strategy=null name=hifigan trainer.check_val_every_n_epoch=1 model.train_ds.dataloader_params.batch_size=8 model.validation_ds.dataloader_params.batch_size=8 model.train_ds.dataloader_params.num_workers=4 model.validation_ds.dataloader_params.num_workers=4 +init_from_pretrained_model=tts_zh_hifigan_sfspeech --config-name hifigan.yaml > log.out 2>&1 &

Output:

nohup: ignoring input
[NeMo W 2023-05-09 20:54:32 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-05-09 20:54:34 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.fastpitch_ssl.FastPitchModel_SSL'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.ssl_tts.SSLDisentangler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.vits.VitsModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'hifigan.yaml': Defaults list is missing `_self_`. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information
      warnings.warn(msg, UserWarning)

[NeMo W 2023-05-09 20:54:37 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2023-05-09 20:54:37 exp_manager:716] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2023-05-09 20:54:37 exp_manager:568] There was no checkpoint folder at checkpoint_dir :jiaran/results/hifigan/checkpoints. Training from scratch.
[NeMo I 2023-05-09 20:54:37 exp_manager:374] Experiments will be logged at jiaran/results/hifigan
[NeMo I 2023-05-09 20:54:37 exp_manager:797] TensorboardLogger has been set up
[NeMo W 2023-05-09 20:54:37 exp_manager:893] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 2500000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo I 2023-05-09 20:54:37 dataset:1045] Loading dataset from ./jiaran//metas/nemo/train_manifest_mel.json.
3191it [00:00, 206435.27it/s]
[NeMo I 2023-05-09 20:54:37 dataset:1069] Loaded dataset with 3191 files.
[NeMo I 2023-05-09 20:54:37 dataset:1071] Dataset contains 6.11 hours.
[NeMo I 2023-05-09 20:54:37 dataset:377] Pruned 0 files. Final dataset contains 3191 files
[NeMo I 2023-05-09 20:54:37 dataset:379] Pruned 0.00 hours. Final dataset contains 6.11 hours.
[NeMo I 2023-05-09 20:54:37 dataset:1045] Loading dataset from ./jiaran//metas/nemo/val_manifest_mel.json.
32it [00:00, 54317.17it/s]
[NeMo I 2023-05-09 20:54:37 dataset:1069] Loaded dataset with 32 files.
[NeMo I 2023-05-09 20:54:37 dataset:1071] Dataset contains 0.06 hours.
[NeMo I 2023-05-09 20:54:37 dataset:377] Pruned 0 files. Final dataset contains 32 files
[NeMo I 2023-05-09 20:54:37 dataset:379] Pruned 0.00 hours. Final dataset contains 0.06 hours.
[NeMo I 2023-05-09 20:54:37 features:291] PADDING: 0
[NeMo I 2023-05-09 20:54:37 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:54:37 features:291] PADDING: 0
[NeMo I 2023-05-09 20:54:37 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:54:38 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_hifigan_sfspeech.nemo to /home/oedosoldier/.cache/torch/NeMo/NeMo_1.18.0rc0/tts_zh_hifigan_sfspeech/ed4a2b913f208e59e3f5f96705394784/tts_zh_hifigan_sfspeech.nemo
[NeMo I 2023-05-09 20:55:01 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-05-09 20:55:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /pred/train_manifest_mel.json
      sample_rate: 22050
      n_segments: 8192
      max_duration: null
      min_duration: 0.75
      load_precomputed_mel: true
      hop_length: 256
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 88
      num_workers: 5
      pin_memory: true

[NeMo W 2023-05-09 20:55:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /pred/val_manifest_mel.json
      sample_rate: 22050
      n_segments: 66048
      max_duration: null
      min_duration: 3
      load_precomputed_mel: true
      hop_length: 256
    dataloader_params:
      drop_last: false
      shuffle: false
      batch_size: 88
      num_workers: 5
      pin_memory: true

[NeMo I 2023-05-09 20:55:02 features:291] PADDING: 0
[NeMo I 2023-05-09 20:55:02 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:55:02 features:291] PADDING: 0
[NeMo I 2023-05-09 20:55:02 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:55:02 save_restore_connector:249] Model HifiGanModel was successfully restored from /home/oedosoldier/.cache/torch/NeMo/NeMo_1.18.0rc0/tts_zh_hifigan_sfspeech/ed4a2b913f208e59e3f5f96705394784/tts_zh_hifigan_sfspeech.nemo.
[NeMo I 2023-05-09 20:55:02 modelPT:1269] Model checkpoint restored from pretrained checkpoint with name : `tts_zh_hifigan_sfspeech`
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                       | Type                     | Params
------------------------------------------------------------------------
0 | audio_to_melspec_precessor | FilterbankFeatures       | 0
1 | trg_melspec_fn             | FilterbankFeatures       | 0
2 | generator                  | Generator                | 13.9 M
3 | mpd                        | MultiPeriodDiscriminator | 41.1 M
4 | msd                        | MultiScaleDiscriminator  | 29.6 M
5 | feature_loss               | FeatureMatchingLoss      | 0
6 | discriminator_loss         | DiscriminatorLoss        | 0
7 | generator_loss             | GeneratorLoss            | 0
------------------------------------------------------------------------
84.7 M    Trainable params
0         Non-trainable params
84.7 M    Total params
169.321   Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][NeMo W 2023-05-09 20:55:03 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/torch/functional.py:641: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
      return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

Training: 0it [00:00, ?it/s][NeMo I 2023-05-09 20:55:03 preemption:56] Preemption requires torch distributed to be initialized, disabling preemption
Epoch 0:   0%|          | 0/403 [00:00<?, ?it/s] [NeMo W 2023-05-09 20:55:04 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

[NeMo W 2023-05-09 20:55:04 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
      warning_cache.warn(

Epoch 0:   0%|          | 2/403 [00:00<03:13,  2.07it/s, v_num=38, g_l1_loss=0.785]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stuck at here

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: local WSL2 (tried multiple versions, including ubuntu 20.04 22.04 debian 11.6, all stucked)
Method of NeMo install: git cloned and set by PYTHONPATH

Environment details

Windows 11 22H2
WSL2
RTX 4090

Conda environment:

name: fs
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_kmp_llvm
- bzip2=1.0.8=h7b6447c_0
- ca-certificates=2023.5.7=hbcca054_0
- cython=0.29.34=py310heca2aa9_0
- ld_impl_linux-64=2.38=h1181459_1
- libblas=3.9.0=16_linux64_openblas
- libcblas=3.9.0=16_linux64_openblas
- libffi=3.4.2=h6a678d5_6
- libgcc-ng=12.2.0=h65d4601_19
- libgfortran-ng=12.2.0=h69a702a_19
- libgfortran5=12.2.0=h337968e_19
- liblapack=3.9.0=16_linux64_openblas
- libopenblas=0.3.21=pthreads_h78a6416_3
- libstdcxx-ng=12.2.0=h46fd767_19
- libuuid=1.41.5=h5eee18b_0
- llvm-openmp=14.0.6=h9e868ea_0
- ncurses=6.4=h6a678d5_0
- numpy=1.24.3=py310ha4c1d20_0
- openssl=1.1.1t=h0b41bf4_0
- pip=23.0.1=py310h06a4308_0
- python=3.10.9=h7a1cb2a_2
- python_abi=3.10=2_cp310
- pyworld=0.3.3=py310h5a539fb_0
- readline=8.2=h5eee18b_0
- sqlite=3.41.2=h5eee18b_0
- tk=8.6.12=h1ccaba5_0
- wheel=0.38.4=py310h06a4308_0
- xz=5.4.2=h5eee18b_0
- zlib=1.2.13=h5eee18b_0
- pip:
  - absl-py==1.4.0
  - aiohttp==3.8.4
  - aiosignal==1.3.1
  - antlr4-python3-runtime==4.9.3
  - appdirs==1.4.4
  - asttokens==2.2.1
  - async-timeout==4.0.2
  - attrdict==2.0.1
  - attrs==23.1.0
  - audioread==3.0.0
  - backcall==0.2.0
  - braceexpand==0.1.7
  - cachetools==5.3.0
  - cdifflib==1.2.6
  - certifi==2022.12.7
  - cffi==1.15.1
  - charset-normalizer==2.1.1
  - click==8.1.3
  - cmake==3.25.0
  - colorama==0.4.6
  - comm==0.1.3
  - contourpy==1.0.7
  - cycler==0.11.0
  - debugpy==1.6.7
  - decorator==5.1.1
  - distance==0.1.3
  - docker-pycreds==0.4.0
  - docopt==0.6.2
  - editdistance==0.6.2
  - einops==0.6.1
  - executing==1.2.0
  - filelock==3.9.0
  - fonttools==4.39.3
  - frozenlist==1.3.3
  - fsspec==2023.5.0
  - g2p-en==2.1.0
  - gitdb==4.0.10
  - gitpython==3.1.31
  - google-auth==2.17.3
  - google-auth-oauthlib==1.0.0
  - grpcio==1.54.0
  - huggingface-hub==0.14.1
  - hydra-core==1.2.0
  - idna==3.4
  - inflect==6.0.4
  - ipykernel==6.23.0
  - ipython==8.13.2
  - ipywidgets==8.0.6
  - jedi==0.18.2
  - jieba==0.42.1
  - jinja2==3.1.2
  - jiwer==3.0.1
  - joblib==1.2.0
  - jupyter-client==8.2.0
  - jupyter-core==5.3.0
  - jupyterlab-widgets==3.0.7
  - kaldi-python-io==1.2.2
  - kaldiio==2.18.0
  - kiwisolver==1.4.4
  - kornia==0.6.12
  - lazy-loader==0.2
  - levenshtein==0.21.0
  - librosa==0.10.0.post2
  - lightning-utilities==0.8.0
  - lit==15.0.7
  - llvmlite==0.40.0
  - loguru==0.7.0
  - markdown==3.4.3
  - markdown-it-py==2.2.0
  - markupsafe==2.1.2
  - marshmallow==3.19.0
  - matplotlib==3.7.1
  - matplotlib-inline==0.1.6
  - mdurl==0.1.2
  - mpmath==1.2.1
  - msgpack==1.0.5
  - multidict==6.0.4
  - nemo-text-processing==0.1.7rc0
  - nest-asyncio==1.5.6
  - networkx==3.0
  - nltk==3.8.1
  - numba==0.57.0
  - oauthlib==3.2.2
  - omegaconf==2.2.3
  - onnx==1.14.0
  - packaging==23.1
  - pandas==2.0.1
  - parso==0.8.3
  - pathtools==0.1.2
  - pexpect==4.8.0
  - pickleshare==0.7.5
  - pillow==9.3.0
  - plac==1.3.5
  - platformdirs==3.5.0
  - pooch==1.6.0
  - prompt-toolkit==3.0.38
  - protobuf==4.23.0
  - psutil==5.9.5
  - ptyprocess==0.7.0
  - pure-eval==0.2.2
  - pyannote-core==5.0.0
  - pyannote-database==5.0.1
  - pyannote-metrics==3.2.1
  - pyasn1==0.5.0
  - pyasn1-modules==0.3.0
  - pybind11==2.10.4
  - pycparser==2.21
  - pydantic==1.10.7
  - pydub==0.25.1
  - pygments==2.15.1
  - pynini==2.1.5
  - pyparsing==3.0.9
  - pypinyin==0.48.0
  - pypinyin-dict==0.5.0
  - python-dateutil==2.8.2
  - pytorch-lightning==1.9.4
  - pytz==2023.3
  - pyyaml==5.4.1
  - pyzmq==25.0.2
  - rapidfuzz==2.13.7
  - regex==2023.5.5
  - requests==2.28.1
  - requests-oauthlib==1.3.1
  - rich==13.3.5
  - rsa==4.9
  - ruamel-yaml==0.17.24
  - ruamel-yaml-clib==0.2.7
  - sacremoses==0.0.53
  - scikit-learn==1.2.2
  - scipy==1.10.1
  - sentencepiece==0.1.99
  - sentry-sdk==1.22.2
  - setproctitle==1.3.2
  - setuptools==65.5.1
  - shellingham==1.5.0.post1
  - six==1.16.0
  - smmap==5.0.0
  - sortedcontainers==2.4.0
  - soundfile==0.12.1
  - sox==1.4.1
  - soxr==0.3.5
  - stack-data==0.6.2
  - sympy==1.11.1
  - tabulate==0.9.0
  - tensorboard==2.13.0
  - tensorboard-data-server==0.7.0
  - termcolor==2.3.0
  - text-unidecode==1.3
  - texterrors==0.4.4
  - threadpoolctl==3.1.0
  - tokenizers==0.13.3
  - torch==2.0.1+cu118
  - torchaudio==2.0.2+cu118
  - torchmetrics==0.11.4
  - torchvision==0.15.2+cu118
  - tornado==6.3.1
  - tqdm==4.65.0
  - traitlets==5.9.0
  - transformers==4.28.1
  - triton==2.0.0
  - typer==0.9.0
  - typing-extensions==4.4.0
  - tzdata==2023.3
  - urllib3==1.26.13
  - wandb==0.15.2
  - wcwidth==0.2.6
  - webdataset==0.1.62
  - werkzeug==2.3.4
  - wget==3.2
  - widgetsnbextension==4.0.7
  - wrapt==1.15.0
  - yarl==1.9.2
  - youtokentome==1.0.6
prefix: /home/oedosoldier/anaconda3/envs/fs

Additional context

I tried to figure out which step it stuck, turned out it's caused by manual_backward().

In hifigan.py, there are two manual_backward() calls, and it will randomly stuck at one of them:

        # Train discriminator
        optim_d.zero_grad()
        mpd_score_real, mpd_score_gen, _, _ = self.mpd(y=audio, y_hat=audio_pred.detach())
        loss_disc_mpd, _, _ = self.discriminator_loss(
            disc_real_outputs=mpd_score_real, disc_generated_outputs=mpd_score_gen
        )
        msd_score_real, msd_score_gen, _, _ = self.msd(y=audio, y_hat=audio_pred.detach())
        loss_disc_msd, _, _ = self.discriminator_loss(
            disc_real_outputs=msd_score_real, disc_generated_outputs=msd_score_gen
        )
        loss_d = loss_disc_msd + loss_disc_mpd
        self.manual_backward(loss_d) <- here
        ^^^^^^^^^^^^^^^^^^^
        optim_d.step()

        # Train generator
        optim_g.zero_grad()
        loss_mel = F.l1_loss(audio_pred_mel, audio_trg_mel)
        _, mpd_score_gen, fmap_mpd_real, fmap_mpd_gen = self.mpd(y=audio, y_hat=audio_pred)
        _, msd_score_gen, fmap_msd_real, fmap_msd_gen = self.msd(y=audio, y_hat=audio_pred)
        loss_fm_mpd = self.feature_loss(fmap_r=fmap_mpd_real, fmap_g=fmap_mpd_gen)
        loss_fm_msd = self.feature_loss(fmap_r=fmap_msd_real, fmap_g=fmap_msd_gen)
        loss_gen_mpd, _ = self.generator_loss(disc_outputs=mpd_score_gen)
        loss_gen_msd, _ = self.generator_loss(disc_outputs=msd_score_gen)
        loss_g = loss_gen_msd + loss_gen_mpd + loss_fm_msd + loss_fm_mpd + loss_mel * self.l1_factor
        self.manual_backward(loss_g) <- here
        ^^^^^^^^^^^^^^^^^^^
        optim_g.step()

NVIDIA / NeMo

Finetuning HiFiGAN will stuck at random step #6607