`CUDA error: device-side assert triggered` when finetuning multi-speaker VITS

godspirit00 commented 2 years ago

Hi, I was trying to finetune the vctk vits model (kan-bayashi/vctk_tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave) with my own dataset, but I met the error:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [13,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [13,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [13,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [13,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [13,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
(...repeats several times...)

/root/autodl-tmp/espnet/espnet2/layers/stft.py:166: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  olens = (ilens - self.n_fft) // self.hop_length + 1
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/autodl-tmp/espnet/espnet2/bin/gan_tts_train.py", line 22, in <module>
    main()
  File "/root/autodl-tmp/espnet/espnet2/bin/gan_tts_train.py", line 18, in main
    GANTTSTask.main(cmd=cmd)
  File "/root/autodl-tmp/espnet/espnet2/tasks/abs_task.py", line 1019, in main
    cls.main_worker(args)
  File "/root/autodl-tmp/espnet/espnet2/tasks/abs_task.py", line 1315, in main_worker
    cls.trainer.run(
  File "/root/autodl-tmp/espnet/espnet2/train/trainer.py", line 286, in run
    all_steps_are_invalid = cls.train_one_epoch(
  File "/root/autodl-tmp/espnet/espnet2/train/gan_trainer.py", line 160, in train_one_epoch
    retval = model(forward_generator=turn == "generator", **batch)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/espnet_model.py", line 162, in forward
    return self.tts(**batch)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/vits.py", line 315, in forward
    return self._forward_discrminator(
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/vits.py", line 478, in _forward_discrminator
    outs = self.generator(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/generator.py", line 338, in forward
    z, m_q, logs_q, y_mask = self.posterior_encoder(feats, feats_lengths, g=g)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/posterior_encoder.py", line 104, in forward
    make_non_pad_mask(x_lengths)
  File "/root/autodl-tmp/espnet/espnet/nets/pytorch_backend/nets_utils.py", line 269, in make_non_pad_mask
    return ~make_pad_mask(lengths, xs, length_dim)
  File "/root/autodl-tmp/espnet/espnet/nets/pytorch_backend/nets_utils.py", line 154, in make_pad_mask
    lengths = lengths.tolist()
RuntimeError: CUDA error: device-side assert triggered

I ran the training with

run.sh --stage 6 --stop-stage 6 --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/tuning/finetune_vits.yaml --use-sid true --train_args "--init_param ~/autodl-tmp/6996f03d3eb58d62833eeb65ba6feb5b/exp/tts_train_multi_spk_vits_raw_phn_tacotron_g2p_en_no_space/train.total_count.ave_10best.pth:tts:tts:tts.generator.global_emb.weight" --tag syb_multi-ft-from-vctk-vits

My config is

##########################################################
#                  TTS MODEL SETTING                     #
##########################################################
tts: vits
tts_conf:
    # generator related
    generator_type: vits_generator
    generator_params:
        hidden_channels: 192
        spks: 5
        global_channels: 256
        segment_size: 32
        text_encoder_attention_heads: 2
        text_encoder_ffn_expand: 4
        text_encoder_blocks: 6
        text_encoder_positionwise_layer_type: "conv1d"
        text_encoder_positionwise_conv_kernel_size: 3
        text_encoder_positional_encoding_layer_type: "rel_pos"
        text_encoder_self_attention_layer_type: "rel_selfattn"
        text_encoder_activation_type: "swish"
        text_encoder_normalize_before: true
        text_encoder_dropout_rate: 0.1
        text_encoder_positional_dropout_rate: 0.0
        text_encoder_attention_dropout_rate: 0.1
        use_macaron_style_in_text_encoder: true
        # NOTE(kan-bayashi): Conformer conv requires BatchNorm1d which causes
        #   errors when multiple GPUs in pytorch 1.7.1. Therefore, we disable
        #   it as a default. We need to consider the alternative normalization
        #   or different version pytorch may solve this issue.
        use_conformer_conv_in_text_encoder: false
        text_encoder_conformer_kernel_size: -1
        decoder_kernel_size: 7
        decoder_channels: 512
        decoder_upsample_scales: [8, 8, 2, 2]
        decoder_upsample_kernel_sizes: [16, 16, 4, 4]
        decoder_resblock_kernel_sizes: [3, 7, 11]
        decoder_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
        use_weight_norm_in_decoder: true
        posterior_encoder_kernel_size: 5
        posterior_encoder_layers: 16
        posterior_encoder_stacks: 1
        posterior_encoder_base_dilation: 1
        posterior_encoder_dropout_rate: 0.0
        use_weight_norm_in_posterior_encoder: true
        flow_flows: 4
        flow_kernel_size: 5
        flow_base_dilation: 1
        flow_layers: 4
        flow_dropout_rate: 0.0
        use_weight_norm_in_flow: true
        use_only_mean_in_flow: true
        stochastic_duration_predictor_kernel_size: 3
        stochastic_duration_predictor_dropout_rate: 0.5
        stochastic_duration_predictor_flows: 4
        stochastic_duration_predictor_dds_conv_layers: 3
    # discriminator related
    discriminator_type: hifigan_multi_scale_multi_period_discriminator
    discriminator_params:
        scales: 1
        scale_downsample_pooling: "AvgPool1d"
        scale_downsample_pooling_params:
            kernel_size: 4
            stride: 2
            padding: 2
        scale_discriminator_params:
            in_channels: 1
            out_channels: 1
            kernel_sizes: [15, 41, 5, 3]
            channels: 128
            max_downsample_channels: 1024
            max_groups: 16
            bias: True
            downsample_scales: [2, 2, 4, 4, 1]
            nonlinear_activation: "LeakyReLU"
            nonlinear_activation_params:
                negative_slope: 0.1
            use_weight_norm: True
            use_spectral_norm: False
        follow_official_norm: False
        periods: [2, 3, 5, 7, 11]
        period_discriminator_params:
            in_channels: 1
            out_channels: 1
            kernel_sizes: [5, 3]
            channels: 32
            downsample_scales: [3, 3, 3, 3, 1]
            max_downsample_channels: 1024
            bias: True
            nonlinear_activation: "LeakyReLU"
            nonlinear_activation_params:
                negative_slope: 0.1
            use_weight_norm: True
            use_spectral_norm: False
    # loss function related
    generator_adv_loss_params:
        average_by_discriminators: false # whether to average loss value by #discriminators
        loss_type: mse                   # loss type, "mse" or "hinge"
    discriminator_adv_loss_params:
        average_by_discriminators: false # whether to average loss value by #discriminators
        loss_type: mse                   # loss type, "mse" or "hinge"
    feat_match_loss_params:
        average_by_discriminators: false # whether to average loss value by #discriminators
        average_by_layers: false         # whether to average loss value by #layers of each discriminator
        include_final_outputs: true      # whether to include final outputs for loss calculation
    mel_loss_params:
        fs: 22050          # must be the same as the training data
        n_fft: 1024        # fft points
        hop_length: 256    # hop size
        win_length: null   # window length
        window: hann       # window type
        n_mels: 80         # number of Mel basis
        fmin: 0            # minimum frequency for Mel basis
        fmax: null         # maximum frequency for Mel basis
        log_base: null     # null represent natural log
    lambda_adv: 1.0        # loss scaling coefficient for adversarial loss
    lambda_mel: 45.0       # loss scaling coefficient for Mel loss
    lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
    lambda_dur: 1.0        # loss scaling coefficient for duration loss
    lambda_kl: 1.0         # loss scaling coefficient for KL divergence loss
    # others
    sampling_rate: 22050          # needed in the inference for saving wav
    cache_generator_outputs: true # whether to cache generator outputs in the training

##########################################################
#            OPTIMIZER & SCHEDULER SETTING               #
##########################################################
# optimizer setting for generator
optim: adamw
optim_conf:
    lr: 1.0e-4
    betas: [0.8, 0.99]
    eps: 1.0e-9
    weight_decay: 0.0
scheduler: exponentiallr
scheduler_conf:
    gamma: 0.999875
# optimizer setting for discriminator
optim2: adamw
optim2_conf:
    lr: 1.0e-4
    betas: [0.8, 0.99]
    eps: 1.0e-9
    weight_decay: 0.0
scheduler2: exponentiallr
scheduler2_conf:
    gamma: 0.999875
generator_first: false # whether to start updating generator first

##########################################################
#                OTHER TRAINING SETTING                  #
##########################################################
num_iters_per_epoch: 1000 # number of iterations per epoch
max_epoch: 100            # number of epochs
accum_grad: 1             # gradient accumulation
batch_bins: 5000000       # batch bins (feats_type=raw)
batch_type: numel         # how to make batch
grad_clip: -1             # gradient clipping norm
grad_noise: false         # whether to use gradient noise injection
sort_in_batch: descending # how to sort data in making batch
sort_batch: descending    # how to sort created batches
num_workers: 4            # number of workers of data loader
use_amp: false            # whether to use pytorch amp
log_interval: 50          # log interval in iterations
keep_nbest_models: 10     # number of models to keep
num_att_plot: 3           # number of attention figures to be saved in every check
seed: 777                 # random seed number
patience: null            # patience for early stopping
unused_parameters: true   # needed for multi gpu case
best_model_criterion:     # criterion to save the best models
-   - train
    - total_count
    - max
cudnn_deterministic: false # setting to false accelerates the training speed but makes it non-deterministic
                           # in the case of GAN-TTS training, we strongly recommend setting to false
cudnn_benchmark: false     # setting to true might acdelerate the training speed but sometimes decrease it
                           # therefore, we set to false as a default (recommend trying both cases)

What am I missing? Thank you!

kan-bayashi commented 2 years ago

Sorry I have no clear answer for this errors. Let me check some points:

Does it work the existing recipe? I'm wondering this errors is came from your data.
If you provide more info (https://github.com/espnet/espnet/blob/master/.github/ISSUE_TEMPLATE/bug_report.md), I may get more insight.

godspirit00 commented 2 years ago

Does it work the existing recipe?

Should I fine-tune the vctk vits model with the vctk dataset or is there an existing recipe for fine-tuning a multi speaker vits model with a multi speaker dataset?

godspirit00 commented 2 years ago

(Sorry I pressed the wrong button...) I have tried fine-tuning the LJ fastspeech 2 model with the data of one of the speakers from my dataset, and it works. Maybe there's something wrong with my multi speaker dataset? But how can I find out what's wrong? Thank you!

kan-bayashi commented 2 years ago

Should I fine-tune the vctk vits model with the vctk dataset or is there an existing recipe for fine-tuning a multi speaker vits model with a multi speaker dataset?

I think this is not related to finetuning. Maybe the error will happen without --init_param.

I have tried fine-tuning the LJ fastspeech 2 model with the data of one of the speakers from my dataset, and it works.

OK. Have you ever tried VITS?

Maybe there's something wrong with my multi speaker dataset? But how can I find out what's wrong?

Does your dataset include only mono audio? If it contains mixed of mono and stereo, that will be a problem.

godspirit00 commented 2 years ago

Does your dataset include only mono audio? If it contains mixed of mono and stereo, that will be a problem.

I checked all the data and they are all mono.

Have you ever tried VITS?

Not yet. Then I will try fine-tuning the LJ vits model with also one of the speakers' data and see if it has something to do with vits.

godspirit00 commented 2 years ago

@kan-bayashi I've started to finetune LJ vits model (kan-bayashi/ljspeech_tts_train_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave) with the data of one of the speakers in my dataset. It has run several epochs now, and no error (except OOM, which I fixed by lowering batch_bins).

So I guess the error I met before comes from the multi-speaker related part.

kan-bayashi commented 2 years ago

Thank you for your kind report. Then, let us check each point.

Run w/o --init_param

run.sh --stage 6 --stop-stage 6 --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/tuning/finetune_vits.yaml --use-sid true --tag debug_1

Run w/o SID (you can run it with multi-speaker data for debugging purpos)

run.sh --stage 6 --stop-stage 6 --tts_task gan_tts --feats_extract linear_spectrogram --feats_normalize none --train_config ./conf/tuning/<single_speaker_vits>.yaml  --tag debug_2

kamo-naoyuki commented 2 years ago

Running on CPU might provide a more informative message (I believe this is a bug on espnet side).

godspirit00 commented 2 years ago

Run w/o --init_param

This produces the error too.

2. Run w/o SID

This can start without error.

So I guess the problem comes from the multi-speaker part.

Running on CPU might provide a more informative message

I am running on a cloud GPU, so I ran it in no-GPU mode, and it said RuntimeError: No CUDA GPUs are available.

kan-bayashi commented 2 years ago

—ngpu 0 does not work in GPU env?

godspirit00 commented 2 years ago

—ngpu 0 does not work in GPU env?

With --ngpu 0, the error message is:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/autodl-tmp/espnet/espnet2/bin/gan_tts_train.py", line 22, in <module>
    main()
  File "/root/autodl-tmp/espnet/espnet2/bin/gan_tts_train.py", line 18, in main
    GANTTSTask.main(cmd=cmd)
  File "/root/autodl-tmp/espnet/espnet2/tasks/abs_task.py", line 1019, in main
    cls.main_worker(args)
  File "/root/autodl-tmp/espnet/espnet2/tasks/abs_task.py", line 1315, in main_worker
    cls.trainer.run(
  File "/root/autodl-tmp/espnet/espnet2/train/trainer.py", line 286, in run
    all_steps_are_invalid = cls.train_one_epoch(
  File "/root/autodl-tmp/espnet/espnet2/train/gan_trainer.py", line 160, in train_one_epoch
    retval = model(forward_generator=turn == "generator", **batch)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/espnet_model.py", line 162, in forward
    return self.tts(**batch)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/vits.py", line 315, in forward
    return self._forward_discrminator(
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/vits.py", line 478, in _forward_discrminator
    outs = self.generator(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/autodl-tmp/espnet/espnet2/gan_tts/vits/generator.py", line 321, in forward
    g = self.global_emb(sids.view(-1)).unsqueeze(-1)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

kan-bayashi commented 2 years ago

This is the true error. You set spks: 5 in yaml but maybe the sid is not within 1-5. Please check dump/raw/org/tr_no_dev/spk2sid (The path may be different).

godspirit00 commented 2 years ago

That solves the problem! I set spks: 6 and the training can start now. Thank you so much!

espnet / espnet

`CUDA error: device-side assert triggered` when finetuning multi-speaker VITS #4156