coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.7k stars 4.21k forks source link

VITS training fails on multi-GPU #1639

Closed planetrocke closed 2 years ago

planetrocke commented 2 years ago

Describe the bug

Successfully trained this dataset with VITs on a single GPU. When I attempted to train it using multiple GPUs, it froze on STEP 0. The error is:

DistributedDataParallel' object has no attribute 'test_log'

To Reproduce

Run the following command:

python3 -m trainer.distribute --gpus=0,1 --script recipes/ljspeech.vits_tts/train_vits.py

Expected behavior

The model should train to x epochs with multiple GPU support.

Logs

['recipes/ljspeech/vits_tts/train_vits.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_09-175020', '--use_ddp=true', '--rank=0']
['recipes/ljspeech/vits_tts/train_vits.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_09-175020', '--use_ddp=true', '--rank=1']
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:False
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 | > Found 615 files in /root/TTS/recipes/ljspeech/LJSpeech-1.1
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:False
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 | > Found 615 files in /root/TTS/recipes/ljspeech/LJSpeech-1.1
 > Using CUDA: True
 > Number of GPUs: 2

 > Model has 83059180 parameters

 > EPOCH: 0/1000
 --> /root/TTS/recipes/ljspeech/vits_tts/vits_ljspeech-June-09-2022_05+50PM-c44e39d9

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: espeak

| > Number of instances : 609
> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: espeak
| > Number of instances : 609
 | > Preprocessing samples
 | > Preprocessing samples
 | > Max text length: 91
 | > Max text length: 91
 | > Min text length: 19
 | > Min text length: 19
 | > Avg text length: 38.89655172413793
 | 
 | > Avg text length: 38.89655172413793
 | 
 | > Max audio length: 298525.0
 | > Min audio length: 84509.0
 | > Max audio length: 298525.0
 | > Avg audio length: 137949.52545155992
 | > Num. instances discarded samples: 0
 | > Batch group size: 160.
 | > Min audio length: 84509.0
 | > Avg audio length: 137949.52545155992
 | > Num. instances discarded samples: 0
 | > Batch group size: 160.

 > TRAINING (2022-06-09 17:50:27) 
/root/miniconda3/envs/vits/lib/python3.10/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755897462/work/aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/root/miniconda3/envs/vits/lib/python3.10/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755897462/work/aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

   --> STEP: 0/10 -- GLOBAL_STEP: 0
     | > loss_disc: 6.00694  (6.00694)
     | > loss_disc_real_0: 0.99218  (0.99218)
     | > loss_disc_real_1: 1.03287  (1.03287)
     | > loss_disc_real_2: 0.98367  (0.98367)
     | > loss_disc_real_3: 1.03703  (1.03703)
     | > loss_disc_real_4: 0.98296  (0.98296)
     | > loss_disc_real_5: 0.97734  (0.97734)
     | > amp_scaler: 32768.00000  (32768.00000)
     | > loss_0: 6.00694  (6.00694)
     | > grad_norm_0: 0.00000  (0.00000)
     | > loss_gen: 6.00606  (6.00606)
     | > loss_kl: 196.91325  (196.91325)
     | > loss_feat: 0.48936  (0.48936)
     | > loss_mel: 100.56306  (100.56306)
     | > loss_duration: 1.93103  (1.93103)
     | > loss_1: 305.90277  (305.90277)
     | > grad_norm_1: 0.00000  (0.00000)
     | > current_lr_0: 0.00020 
     | > current_lr_1: 0.00020 
     | > step_time: 1.21090  (1.21092)
     | > loader_time: 0.88590  (0.88586)

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: espeak
| > Number of instances : 6

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: espeak
| > Number of instances : 6
 | > Preprocessing samples
 | > Max text length: 59
 | > Preprocessing samples
 | > Min text length: 24
 | > Max text length: 59
 | > Min text length: 24
 | > Avg text length: 37.333333333333336
 | 
 | > Max audio length: 160285.0
 | > Min audio length: 93213.0
 | > Avg audio length: 125639.66666666667
 | > Num. instances discarded samples: 0
 | > Avg text length: 37.333333333333336
 | > Batch group size: 0.
 | 
 | > Max audio length: 160285.0
 | > Min audio length: 93213.0
 | > Avg audio length: 125639.66666666667
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.

 > EVALUATION 

   --> STEP: 0
     | > loss_disc: 3.15863  (3.15863)
     | > loss_disc_real_0: 0.31609  (0.31609)
     | > loss_disc_real_1: 0.34492  (0.34492)
     | > loss_disc_real_2: 0.34172  (0.34172)
     | > loss_disc_real_3: 0.43289  (0.43289)
     | > loss_disc_real_4: 0.37139  (0.37139)
     | > loss_disc_real_5: 0.39517  (0.39517)
     | > loss_0: 3.15863  (3.15863)
     | > loss_gen: 2.24637  (2.24637)
     | > loss_kl: 38.07505  (38.07505)
     | > loss_feat: 0.38158  (0.38158)
     | > loss_mel: 102.72671  (102.72671)
     | > loss_duration: 2.08467  (2.08467)
     | > loss_1: 145.51439  (145.51439)

 | > Synthesizing test sentences.

/root/TTS/TTS/tts/models/vits.py:1394: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matricesor `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at  /opt/conda/conda-bld/pytorch_1646755897462/work/aten/src/ATen/native/TensorShape.cpp:2318.)
  test_figures["{}-alignment".format(idx)] = plot_alignment(alignment.T, output_fig=False)
 ! Run is removed from /root/TTS/recipes/ljspeech/vits_tts/vits_ljspeech-June-09-2022_05+50PM-c44e39d9
Traceback (most recent call last):
  File "/root/miniconda3/envs/vits/lib/python3.10/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/root/miniconda3/envs/vits/lib/python3.10/site-packages/trainer/trainer.py", line 1480, in _fit
    self.test_run()
  File "/root/miniconda3/envs/vits/lib/python3.10/site-packages/trainer/trainer.py", line 1416, in test_run
    self.model.test_log(test_outputs, self.dashboard_logger, self.training_assets, self.total_steps_done)
  File "/root/miniconda3/envs/vits/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'test_log'

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.4",
        "version": "#39-Ubuntu SMP Wed Jun 1 19:16:45 UTC 2022"
    }
}

Additional context

No response

planetrocke commented 2 years ago

I think this may be more relevant on the trainer repo, if so, please let me know. It seems as if the VITs training's parallel pieces need some work?

planetrocke commented 2 years ago

Update: I was able to get the training to work with 2 GPUs by commenting out the 1185-1189 lines pertaining to test_log. However, the GPUs acted a bit odd. The training was 50% longer per epoch, and the first card seemed to process while the second card just pegged at nearly 100% and didn't move.

Any thoughts on this? I definitely would like multi-GPU to work with VITS.

Wikidepia commented 2 years ago

It's fixed in the latest version of Coqui Trainer. Try installing https://github.com/coqui-ai/Trainer with pip3 install -U git+https://github.com/coqui-ai/Trainer.

planetrocke commented 2 years ago

Oh sweet. Should I be using trainer by itself instead of as part of the TTS release?On Jun 14, 2022, at 8:59 PM, Akmal @.***> wrote: It's fixed in the latest version of Coqui Trainer. Try installing https://github.com/coqui-ai/Trainer with pip3 install -U git+https://github.com/coqui-ai/Trainer.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

Wikidepia commented 2 years ago

Yep, you need to uninstall trainer with pip and install the new one from github

planetrocke commented 2 years ago

Thank you both. The multi-GPU works now, however, I cannot figure out why the box keeps crashing. I have dual 3090s, I thought it might be overpowering, so I set a forced limit to 280, now I am trying on 250. I've also read where any PSU below platinum cannot handle the spikes in power. I know I can't use max power, as the PSU is a 1200W, but it isn't coming close to that. Any input on this from 30-series folks would be awesome. Thanks again.

planetrocke commented 2 years ago

OK, so even at 250 it fails (as in the system crashes, no logs, etc.). I'm hoping someone can provide input.

erogol commented 2 years ago

I guess the core problem is solved by reinstalling. I close this issue. Feel free to continue here on the discussions.