NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.96k stars 2.49k forks source link

Finetuning HiFiGAN will stuck at random step #6607

Closed OedoSoldier closed 1 year ago

OedoSoldier commented 1 year ago

Describe the bug

I'm trying to finetuning HiFiGAN using this lib, using WSL2 I could start training process, but it will stuck at random step.

Steps/Code to reproduce bug

Running prompts:

LD_LIBRARY_PATH=/home/oedosoldier/anaconda3/envs/fs/lib:$LD_LIBRARY_PATH HYDRA_FULL_ERROR=1 PYTHONPATH=./jiaran//codes/NeMo CUDA_VISIBLE_DEVICES=0 nohup python ./jiaran//codes/NeMo/examples/tts/hifigan_finetune.py train_dataset=./jiaran//metas/nemo/train_manifest_mel.json validation_datasets=./jiaran//metas/nemo/val_manifest_mel.json exp_manager.exp_dir=./jiaran//results model/train_ds=train_ds_finetune model/validation_ds=val_ds_finetune trainer.strategy=null name=hifigan trainer.check_val_every_n_epoch=1 model.train_ds.dataloader_params.batch_size=8 model.validation_ds.dataloader_params.batch_size=8 model.train_ds.dataloader_params.num_workers=4 model.validation_ds.dataloader_params.num_workers=4 +init_from_pretrained_model=tts_zh_hifigan_sfspeech --config-name hifigan.yaml > log.out 2>&1 &

Output:

nohup: ignoring input
[NeMo W 2023-05-09 20:54:32 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-05-09 20:54:34 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.fastpitch_ssl.FastPitchModel_SSL'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.ssl_tts.SSLDisentangler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 experimental:27] Module <class 'nemo.collections.tts.models.vits.VitsModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-09 20:54:36 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'hifigan.yaml': Defaults list is missing `_self_`. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information
      warnings.warn(msg, UserWarning)

[NeMo W 2023-05-09 20:54:37 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2023-05-09 20:54:37 exp_manager:716] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2023-05-09 20:54:37 exp_manager:568] There was no checkpoint folder at checkpoint_dir :jiaran/results/hifigan/checkpoints. Training from scratch.
[NeMo I 2023-05-09 20:54:37 exp_manager:374] Experiments will be logged at jiaran/results/hifigan
[NeMo I 2023-05-09 20:54:37 exp_manager:797] TensorboardLogger has been set up
[NeMo W 2023-05-09 20:54:37 exp_manager:893] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 2500000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo I 2023-05-09 20:54:37 dataset:1045] Loading dataset from ./jiaran//metas/nemo/train_manifest_mel.json.
3191it [00:00, 206435.27it/s]
[NeMo I 2023-05-09 20:54:37 dataset:1069] Loaded dataset with 3191 files.
[NeMo I 2023-05-09 20:54:37 dataset:1071] Dataset contains 6.11 hours.
[NeMo I 2023-05-09 20:54:37 dataset:377] Pruned 0 files. Final dataset contains 3191 files
[NeMo I 2023-05-09 20:54:37 dataset:379] Pruned 0.00 hours. Final dataset contains 6.11 hours.
[NeMo I 2023-05-09 20:54:37 dataset:1045] Loading dataset from ./jiaran//metas/nemo/val_manifest_mel.json.
32it [00:00, 54317.17it/s]
[NeMo I 2023-05-09 20:54:37 dataset:1069] Loaded dataset with 32 files.
[NeMo I 2023-05-09 20:54:37 dataset:1071] Dataset contains 0.06 hours.
[NeMo I 2023-05-09 20:54:37 dataset:377] Pruned 0 files. Final dataset contains 32 files
[NeMo I 2023-05-09 20:54:37 dataset:379] Pruned 0.00 hours. Final dataset contains 0.06 hours.
[NeMo I 2023-05-09 20:54:37 features:291] PADDING: 0
[NeMo I 2023-05-09 20:54:37 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:54:37 features:291] PADDING: 0
[NeMo I 2023-05-09 20:54:37 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:54:38 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_hifigan_sfspeech.nemo to /home/oedosoldier/.cache/torch/NeMo/NeMo_1.18.0rc0/tts_zh_hifigan_sfspeech/ed4a2b913f208e59e3f5f96705394784/tts_zh_hifigan_sfspeech.nemo
[NeMo I 2023-05-09 20:55:01 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-05-09 20:55:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /pred/train_manifest_mel.json
      sample_rate: 22050
      n_segments: 8192
      max_duration: null
      min_duration: 0.75
      load_precomputed_mel: true
      hop_length: 256
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 88
      num_workers: 5
      pin_memory: true

[NeMo W 2023-05-09 20:55:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /pred/val_manifest_mel.json
      sample_rate: 22050
      n_segments: 66048
      max_duration: null
      min_duration: 3
      load_precomputed_mel: true
      hop_length: 256
    dataloader_params:
      drop_last: false
      shuffle: false
      batch_size: 88
      num_workers: 5
      pin_memory: true

[NeMo I 2023-05-09 20:55:02 features:291] PADDING: 0
[NeMo I 2023-05-09 20:55:02 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:55:02 features:291] PADDING: 0
[NeMo I 2023-05-09 20:55:02 features:299] STFT using exact pad
[NeMo I 2023-05-09 20:55:02 save_restore_connector:249] Model HifiGanModel was successfully restored from /home/oedosoldier/.cache/torch/NeMo/NeMo_1.18.0rc0/tts_zh_hifigan_sfspeech/ed4a2b913f208e59e3f5f96705394784/tts_zh_hifigan_sfspeech.nemo.
[NeMo I 2023-05-09 20:55:02 modelPT:1269] Model checkpoint restored from pretrained checkpoint with name : `tts_zh_hifigan_sfspeech`
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                       | Type                     | Params
------------------------------------------------------------------------
0 | audio_to_melspec_precessor | FilterbankFeatures       | 0
1 | trg_melspec_fn             | FilterbankFeatures       | 0
2 | generator                  | Generator                | 13.9 M
3 | mpd                        | MultiPeriodDiscriminator | 41.1 M
4 | msd                        | MultiScaleDiscriminator  | 29.6 M
5 | feature_loss               | FeatureMatchingLoss      | 0
6 | discriminator_loss         | DiscriminatorLoss        | 0
7 | generator_loss             | GeneratorLoss            | 0
------------------------------------------------------------------------
84.7 M    Trainable params
0         Non-trainable params
84.7 M    Total params
169.321   Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][NeMo W 2023-05-09 20:55:03 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/torch/functional.py:641: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
      return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

Training: 0it [00:00, ?it/s][NeMo I 2023-05-09 20:55:03 preemption:56] Preemption requires torch distributed to be initialized, disabling preemption
Epoch 0:   0%|          | 0/403 [00:00<?, ?it/s] [NeMo W 2023-05-09 20:55:04 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

[NeMo W 2023-05-09 20:55:04 nemo_logging:349] /home/oedosoldier/anaconda3/envs/fs/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
      warning_cache.warn(

Epoch 0:   0%|          | 2/403 [00:00<03:13,  2.07it/s, v_num=38, g_l1_loss=0.785]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stuck at here

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment details

Additional context

I tried to figure out which step it stuck, turned out it's caused by manual_backward().

In hifigan.py, there are two manual_backward() calls, and it will randomly stuck at one of them:

        # Train discriminator
        optim_d.zero_grad()
        mpd_score_real, mpd_score_gen, _, _ = self.mpd(y=audio, y_hat=audio_pred.detach())
        loss_disc_mpd, _, _ = self.discriminator_loss(
            disc_real_outputs=mpd_score_real, disc_generated_outputs=mpd_score_gen
        )
        msd_score_real, msd_score_gen, _, _ = self.msd(y=audio, y_hat=audio_pred.detach())
        loss_disc_msd, _, _ = self.discriminator_loss(
            disc_real_outputs=msd_score_real, disc_generated_outputs=msd_score_gen
        )
        loss_d = loss_disc_msd + loss_disc_mpd
        self.manual_backward(loss_d) <- here
        ^^^^^^^^^^^^^^^^^^^
        optim_d.step()

        # Train generator
        optim_g.zero_grad()
        loss_mel = F.l1_loss(audio_pred_mel, audio_trg_mel)
        _, mpd_score_gen, fmap_mpd_real, fmap_mpd_gen = self.mpd(y=audio, y_hat=audio_pred)
        _, msd_score_gen, fmap_msd_real, fmap_msd_gen = self.msd(y=audio, y_hat=audio_pred)
        loss_fm_mpd = self.feature_loss(fmap_r=fmap_mpd_real, fmap_g=fmap_mpd_gen)
        loss_fm_msd = self.feature_loss(fmap_r=fmap_msd_real, fmap_g=fmap_msd_gen)
        loss_gen_mpd, _ = self.generator_loss(disc_outputs=mpd_score_gen)
        loss_gen_msd, _ = self.generator_loss(disc_outputs=msd_score_gen)
        loss_g = loss_gen_msd + loss_gen_mpd + loss_fm_msd + loss_fm_mpd + loss_mel * self.l1_factor
        self.manual_backward(loss_g) <- here
        ^^^^^^^^^^^^^^^^^^^
        optim_g.step()
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.