Error in training Capacitron

manmay-nakhashi commented 2 years ago

Describe the bug

raceback (most recent call last): File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1533, in fit self._fit() File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1517, in _fit self.train_epoch() File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1282, in trainepoch , _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time) File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1114, in train_step outputs, loss_dict_new, step_time = self._optimize( File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 998, in _optimize outputs, loss_dict = self._model_train_step(batch, model, criterion) File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 954, in _model_train_step return model.train_step(input_args) File "/home/manmay/TTS/TTS/tts/models/tacotron2.py", line 352, in train_step outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input) File "/home/manmay/TTS/TTS/tts/models/tacotron2.py", line 216, in forward encoder_outputs, capacitron_vae_outputs = self.compute_capacitron_VAE_embedding( File "/home/manmay/TTS/TTS/tts/models/base_tacotron.py", line 254, in compute_capacitron_VAE_embedding (VAE_outputs, posterior_distribution, prior_distribution, capacitron_beta,) = self.capacitron_vae_layer( File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/manmay/TTS/TTS/tts/layers/tacotron/capacitron_layers.py", line 67, in forward self.approximate_posterior_distribution = MVN(mu, torch.diag_embed(sigma)) File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/distributions/multivariate_normal.py", line 146, in init super(MultivariateNormal, self).init(batch_shape, event_shape, validate_args=validate_args) File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in init raise ValueError( ValueError: Expected parameter loc (Tensor of shape (128, 128)) of distribution MultivariateNormal(loc: torch.Size([128, 128]), covariance_matrix: torch.Size([128, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], grad_fn=)

To Reproduce

config.txt

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-40GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.7.12",
        "version": "#1 SMP Debian 4.19.249-2 (2022-06-30)"
    }
}

Additional context

No response

manmay-nakhashi commented 2 years ago

reference_mels:  tensor([[[-2.8908, -2.9343, -2.3924,  ..., -4.0000, -4.0000, -4.0000],
         [-1.7576, -2.1274, -1.4687,  ..., -4.0000, -4.0000, -4.0000],
         [-1.5798, -0.8811,  0.6645,  ..., -4.0000, -4.0000, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-2.4235, -3.1822, -3.2268,  ..., -4.0000, -4.0000, -4.0000],
         [-2.8200, -3.1357, -3.4585,  ..., -4.0000, -4.0000, -4.0000],
         [-2.3820, -3.2600, -3.9348,  ..., -3.7277, -3.9225, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-2.9306, -2.6850, -2.3694,  ..., -4.0000, -4.0000, -4.0000],
         [-1.4642, -1.8053, -1.2172,  ..., -4.0000, -4.0000, -4.0000],
         [-0.5219, -0.1292, -0.3755,  ..., -4.0000, -4.0000, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        ...,

        [[-1.6988, -0.4720,  0.4284,  ..., -4.0000, -4.0000, -4.0000],
         [-1.5658, -0.4315,  1.0523,  ..., -4.0000, -4.0000, -4.0000],
         [-2.2187, -0.9966,  1.1072,  ..., -4.0000, -4.0000, -4.0000],
         ...,
         [-0.7557, -1.9654, -1.7806,  ..., -4.0000, -4.0000, -4.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-2.3373, -2.8387, -3.2517,  ..., -4.0000, -4.0000, -4.0000],
         [-2.4404, -2.4244, -3.2637,  ..., -4.0000, -4.0000, -4.0000],
         [-2.6685, -2.3561, -2.6827,  ..., -4.0000, -4.0000, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-3.3388, -3.7791, -3.5639,  ..., -4.0000, -4.0000, -4.0000],
         [-0.7557, -1.9654, -1.7806,  ..., -4.0000, -4.0000, -4.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-2.3373, -2.8387, -3.2517,  ..., -4.0000, -4.0000, -4.0000],
         [-2.4404, -2.4244, -3.2637,  ..., -4.0000, -4.0000, -4.0000],
         [-2.6685, -2.3561, -2.6827,  ..., -4.0000, -4.0000, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-3.3388, -3.7791, -3.5639,  ..., -4.0000, -4.0000, -4.0000],
         [-3.5841, -4.0000, -3.9926,  ..., -4.0000, -4.0000, -4.0000],
         [-4.0000, -4.0000, -4.0000,  ..., -3.9709, -4.0000, -4.0000],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],
       device='cuda:0')
      mel_lengths:  tensor([1049, 1050, 1048, 1051, 1052, 1057, 1052, 1052, 1058, 1050, 1050, 1049,
              1052, 1056, 1053, 1049, 1046, 1048, 1049, 1051, 1051, 1056, 1055, 1050, 
              1047, 1046, 1056, 1046, 1056, 1053, 1050, 1056, 1047, 1045, 1049, 1046, 
              1055, 1055, 1049, 1056, 1050, 1045, 1056, 1052, 1049, 1047, 1049, 1047, 
              1048, 1048, 1056, 1048, 1050, 1045, 1055, 1054, 1047, 1054, 1052, 1053, 
              1057, 1044, 1056, 1052, 1053, 1049, 1057, 1049, 1045, 1052, 1056, 1050, 
              1047, 1048, 1056, 1052, 1045, 1051, 1048, 1047, 1054, 1049, 1050, 1050, 
              1052, 1046, 1057, 1053, 1057, 1055, 1053, 1051, 1052, 1053, 1056, 1049, 
              1057, 1046, 1050, 1049, 1051, 1056, 1050, 1052, 1049, 1050, 1052, 1047, 
              1054, 1051, 1046, 1053, 1057, 1049, 1046, 1055, 1058, 1047, 1056, 1057, 
              1046, 1051, 1056, 1053, 1045, 1056, 1048, 1048], device='cuda:0')
    enc_out1:  tensor([[nan, nan, nan,  ..., nan, nan, nan],
            [nan, nan, nan,  ..., nan, nan, nan],
            [nan, nan, nan,  ..., nan, nan, nan],
            ...,
            [nan, nan, nan,  ..., nan, nan, nan],
            [nan, nan, nan,  ..., nan, nan, nan],
            [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
           grad_fn=<SelectBackward0>)
   enc_out2:  tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<CatBackward0>)
   enc_out3:  tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',

manmay-nakhashi commented 2 years ago

seems like ReferenceEncoder is throwing nan values in capacitron layers

manmay-nakhashi commented 2 years ago

Update:

enc_out = self.encoder(reference_mels, mel_lengths)
enc_out = torch.nan_to_num(enc_out)

adding nan_to_num resolves the issue for now, still monitoring my training.

lexkoro commented 2 years ago

If everything is nan, won't nan_to_num just replace everything with zeros? Not sure it will fix the training

manmay-nakhashi commented 2 years ago

@lexkoro it's coming from some of the samples, how do we skip those ?

lexkoro commented 2 years ago

Remove them from the dataset? ^^

jreus commented 2 years ago

hey @manmay-nakhashi ~ just curious, but have you been able to train Capacitron using the latest coqui (v0.8.0)? And any reason you are using an older version of python (3.7.12)?

manmaynakhashi commented 2 years ago

no reason, i have been using 3.8, 3.9 with coqui so far no problem.

erogol commented 2 years ago

@WeberJulian can you take a look into that?

WeberJulian commented 2 years ago

Training capacitron is hard since it's pretty unstable. Try using the latest recipe since it improved stability (at least for alignments), you can find it on the latest TTS version.

coqui-ai / TTS