dan-wells / fastpitch

NVIDIA's FastPitch, extracted from the DeepLearningExamples repository
BSD 3-Clause "New" or "Revised" License
11 stars 4 forks source link

Error with GPU-distributed training using --use-mas flag. #2

Closed dan-ya closed 7 months ago

dan-ya commented 7 months ago

I got another error when trying to train the model in an original FastPitch way, with --use-mas enabled, when using 4 GPUs. It seems, that it crashed in the end of the 4th epoch. I have no idea what could be a reason for it, as everything was fine during several epochs.

The error message is:

RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/PUHTI_TYKKY_8KAM643/miniconda/envs/env1/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/PUHTI_TYKKY_8KAM643/miniconda/envs/env1/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "fastpitch/data_function.py", line 493, in __call__
    dur_padded[i, :dur.size(0), :dur.size(1)] = dur
RuntimeError: The expanded size of the tensor (113) must match the existing size (114) at non-singleton dimension 1.  Target sizes: [599, 113].  Tensor sizes: [599, 114]

The last output lines were:

[2024-04-02 11:24:25] epoch    4 | iter 2286/2291 | loss  4.63 | mel  1.45 | dur  0.36 | pitch  0.40 | align  3.10 | took 0.29 s
[2024-04-02 11:24:25] epoch    4 | iter 2287/2291 | loss  4.37 | mel  1.24 | dur  0.30 | pitch  0.38 | align  3.06 | took 0.27 s
[2024-04-02 11:24:25] epoch    4 | iter 2288/2291 | loss  5.26 | mel  1.47 | dur  0.37 | pitch  0.51 | align  3.70 | took 0.33 s
[2024-04-02 11:24:26] epoch    4 | iter 2289/2291 | loss  5.24 | mel  1.43 | dur  0.34 | pitch  0.65 | align  3.71 | took 0.38 s
[2024-04-02 11:24:26] epoch    4 | iter 2290/2291 | loss  5.04 | mel  1.66 | dur  0.36 | pitch  0.46 | align  3.30 | took 0.34 s
[2024-04-02 11:24:27] epoch    4 | iter 2291/2291 | loss  5.15 | mel  1.29 | dur  0.36 | pitch  0.38 | align  3.79 | took 0.42 s
[2024-04-02 11:24:30] epoch    4 | avg train loss  5.17 | avg train mel  1.50 | avg train dur  0.37 | avg train pitch  0.49 | avg train align  3.58 | took 909.19 s
[2024-04-02 11:24:48] epoch    4 |   avg val loss  5.39 |   avg val mel  1.92 |   avg val dur  0.34 |   avg val pitch  0.49 |   avg val align  3.41 | took 18.46 s

The setup is the same:

Rank 2: Initializing distributed training
Rank 2: Done initializing distributed training
Rank 1: Initializing distributed training
Rank 1: Done initializing distributed training
Rank 3: Initializing distributed training
Rank 3: Done initializing distributed training
[2024-03-28 10:59:27] PARAMETER | output:  models
[2024-03-28 10:59:27] PARAMETER | dataset_path:  Expreso_LJS_dataset
[2024-03-28 10:59:27] PARAMETER | log_file: 
[2024-03-28 10:59:27] PARAMETER | epochs:  100
[2024-03-28 10:59:27] PARAMETER | epochs_per_checkpoint:  10
[2024-03-28 10:59:27] PARAMETER | checkpoint_path: 
[2024-03-28 10:59:27] PARAMETER | resume:  False
[2024-03-28 10:59:27] PARAMETER | seed:  1234
[2024-03-28 10:59:27] PARAMETER | amp:  False
[2024-03-28 10:59:27] PARAMETER | cuda:  True
[2024-03-28 10:59:27] PARAMETER | cudnn_benchmark:  False
[2024-03-28 10:59:27] PARAMETER | ema_decay:  0
[2024-03-28 10:59:27] PARAMETER | grad_accumulation:  1
[2024-03-28 10:59:27] PARAMETER | optimizer:  lamb
[2024-03-28 10:59:27] PARAMETER | learning_rate:  0.1
[2024-03-28 10:59:27] PARAMETER | weight_decay:  1e-06
[2024-03-28 10:59:27] PARAMETER | grad_clip_thresh:  1000.0
[2024-03-28 10:59:27] PARAMETER | batch_size:  4
[2024-03-28 10:59:27] PARAMETER | warmup_steps:  1000
[2024-03-28 10:59:27] PARAMETER | dur_predictor_loss_scale:  0.1
[2024-03-28 10:59:27] PARAMETER | pitch_predictor_loss_scale:  0.1
[2024-03-28 10:59:27] PARAMETER | attn_loss_scale:  1.0
[2024-03-28 10:59:27] PARAMETER | kl_loss_weight:  1.0
[2024-03-28 10:59:27] PARAMETER | kl_loss_start_epoch:  0
[2024-03-28 10:59:27] PARAMETER | kl_loss_warmup_epochs:  100
[2024-03-28 10:59:27] PARAMETER | training_files:  ['Expreso_LJS_dataset/expresso_fs_style.meta.train.txt']
[2024-03-28 10:59:27] PARAMETER | validation_files:  ['Expreso_LJS_dataset/expresso_fs_style.meta.val.txt']
[2024-03-28 10:59:27] PARAMETER | pitch_mean_std_file:  Expreso_LJS_dataset/pitches_stats__expresso_fs_style.json
[2024-03-28 10:59:27] PARAMETER | input_type:  char
[2024-03-28 10:59:27] PARAMETER | symbol_set:  english_basic
[2024-03-28 10:59:27] PARAMETER | text_cleaners:  []
[2024-03-28 10:59:27] PARAMETER | speaker_ids:  Expreso_LJS_dataset/speaker_ids.txt
[2024-03-28 10:59:27] PARAMETER | lang_ids:  Expreso_LJS_dataset/style_ids.txt
[2024-03-28 10:59:27] PARAMETER | hifigan:  
[2024-03-28 10:59:27] PARAMETER | hifigan_config:  hifigan/config/config_v1.json
[2024-03-28 10:59:27] PARAMETER | sampling_rate:  22050
[2024-03-28 10:59:27] PARAMETER | hop_length:  256
[2024-03-28 10:59:27] PARAMETER | audio_interval:  5
[2024-03-28 10:59:27] PARAMETER | master_addr:  localhost
[2024-03-28 10:59:27] PARAMETER | master_port:  13370
[2024-03-28 10:59:27] PARAMETER | n_mel_channels:  80
[2024-03-28 10:59:27] PARAMETER | n_symbols:  148
[2024-03-28 10:59:27] PARAMETER | padding_idx:  0
[2024-03-28 10:59:27] PARAMETER | symbols_embedding_dim:  384
[2024-03-28 10:59:27] PARAMETER | use_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | use_mas:  True
[2024-03-28 10:59:27] PARAMETER | tvcgmm_k:  0
[2024-03-28 10:59:27] PARAMETER | in_fft_n_layers:  6
[2024-03-28 10:59:27] PARAMETER | in_fft_n_heads:  1
[2024-03-28 10:59:27] PARAMETER | in_fft_d_head:  64
[2024-03-28 10:59:27] PARAMETER | in_fft_conv1d_kernel_size:  3
[2024-03-28 10:59:27] PARAMETER | in_fft_conv1d_filter_size:  1536
[2024-03-28 10:59:27] PARAMETER | in_fft_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | in_fft_output_size:  384
[2024-03-28 10:59:27] PARAMETER | p_in_fft_dropout:  0.1
[2024-03-28 10:59:27] PARAMETER | p_in_fft_dropatt:  0.1
[2024-03-28 10:59:27] PARAMETER | p_in_fft_dropemb:  0.0
[2024-03-28 10:59:27] PARAMETER | out_fft_n_layers:  6
[2024-03-28 10:59:27] PARAMETER | out_fft_n_heads:  1
[2024-03-28 10:59:27] PARAMETER | out_fft_d_head:  64
[2024-03-28 10:59:27] PARAMETER | out_fft_conv1d_kernel_size:  3
[2024-03-28 10:59:27] PARAMETER | out_fft_conv1d_filter_size:  1536
[2024-03-28 10:59:27] PARAMETER | out_fft_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | out_fft_output_size:  384
[2024-03-28 10:59:27] PARAMETER | p_out_fft_dropout:  0.1
[2024-03-28 10:59:27] PARAMETER | p_out_fft_dropatt:  0.1
[2024-03-28 10:59:27] PARAMETER | p_out_fft_dropemb:  0.0
[2024-03-28 10:59:27] PARAMETER | dur_predictor_kernel_size:  3
[2024-03-28 10:59:27] PARAMETER | dur_predictor_filter_size:  256
[2024-03-28 10:59:27] PARAMETER | dur_predictor_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | p_dur_predictor_dropout:  0.1
[2024-03-28 10:59:27] PARAMETER | dur_predictor_n_layers:  2
[2024-03-28 10:59:27] PARAMETER | pitch_predictor_kernel_size:  3
[2024-03-28 10:59:27] PARAMETER | pitch_predictor_filter_size:  256
[2024-03-28 10:59:27] PARAMETER | pitch_predictor_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | p_pitch_predictor_dropout:  0.1
[2024-03-28 10:59:27] PARAMETER | pitch_predictor_n_layers:  2
[2024-03-28 10:59:27] PARAMETER | pitch_embedding_kernel_size:  3
[2024-03-28 10:59:27] PARAMETER | pitch_embedding_sepconv:  False
[2024-03-28 10:59:27] PARAMETER | speaker_cond:  ['pre']
[2024-03-28 10:59:27] PARAMETER | speaker_emb_dim:  384
[2024-03-28 10:59:27] PARAMETER | speaker_emb_weight:  1.0
[2024-03-28 10:59:27] PARAMETER | lang_cond:  ['pre']
[2024-03-28 10:59:27] PARAMETER | lang_emb_dim:  384
[2024-03-28 10:59:27] PARAMETER | lang_emb_weight:  1.0
[2024-03-28 10:59:27] PARAMETER | num_gpus:  4
[2024-03-28 10:59:27] PARAMETER | distributed_run:  True
[2024-03-28 10:59:27] PARAMETER | local_rank:  0
Rank 0: Initializing distributed training
Rank 0: Done initializing distributed training
dan-wells commented 7 months ago

I also don't see why there should be a problem suddenly at this stage of training... It looks like suddenly the number of input symbols for some utterance is larger than the expected maximum value used to set up the padded text input tensor.

See line 462 where max_input_len is set to match the longest text input sequence in the batch, then line 488 where dur_padded is defined with shape (batch, max_target_len, max_input_len). We fill this in per batch item in the following loop, and in your error are trying to insert something with input length 114 when max_input_len is only 113.

You could check the problem and get the utterance ID and some other information by adding something like this assertion immediately before line 493 (you may need to fiddle with this, I haven't tested): assert dur.size(1) <= max_input_len, f"{fnames[i]}, {dur.shape}, {input_lengths[i]}, {max_input_len}"

With that, I would go away and check what the original text is for that utterance, and what the output of your text processor's encode_text() method is, just in case there's anything unexpected, but I really don't know how the output of that could be different from one epoch to the next. Unfortunately that will take a little bit of manual setup I think, i.e. instantiating an equivalent TextProcessor object given your configuration, I don't think all the stuff you would need to debug is available in the collate function that's throwing your error.

dan-wells commented 7 months ago

It also seems strange to me to run into a problem with input lengths at all because I think with your batch_size=4 and distributing to 4 GPUs, each GPU is actually running with a batch of 1: the batch_size option is the effective batch size you want to run, and the details of dividing per GPU are handled in the background (just in case you were reducing this number to account for your multiple GPUs).

dan-ya commented 7 months ago

Thank you very much for your answers, I will try to find out what is going on there.

dan-ya commented 7 months ago

Thank you for your help! I found an error in the data. It was my fault. One file from LJS dataset (LJ036-0032) has a space as a last character in the text and I mistakenly stripped it together with '\n' symbol. The training seems to be running fine. I close this issue.