coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.43k stars 4.06k forks source link

[Bug] ValueError Expected parameter loc with tacotron2_capacitron #2363

Closed dothehansa closed 1 year ago

dothehansa commented 1 year ago

Describe the bug

After training using the tacotron2_capacitron recipe I consistently run into this error:

ValueError: Expected parameter loc (Tensor of shape (64, 128)) of distribution MultivariateNormal(loc: torch.Size([64, 128]), covariance_matrix: torch.Size([64, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], grad_fn=)

To Reproduce

running this command: CUDA_VISIBLE_DEVICES=0 python3 t2_capacitron.py (also does it with continue path: --continue_path ~/Capacitron-Tacotron2-February-25-2023_09+15AM-0000000) I can restart the training & it will continue for a while before crashing again.

Expected behavior

continue training run

Logs

see attached

    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "~/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1141, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "~/.local/lib/python3.10/site-packages/trainer/trainer.py", line 1025, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "~/.local/lib/python3.10/site-packages/trainer/trainer.py", line 970, in _model_train_step
    return model.train_step(*input_args)
  File "~/TTS/TTS/TTS/tts/models/tacotron2.py", line 326, in train_step
    outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input)
  File "~/TTS/TTS/TTS/tts/models/tacotron2.py", line 191, in forward
    encoder_outputs, *capacitron_vae_outputs = self.compute_capacitron_VAE_embedding(
  File "~/TTS/TTS/TTS/tts/models/base_tacotron.py", line 260, in compute_capacitron_VAE_embedding
    ) = self.capacitron_vae_layer(
  File "~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/TTS/TTS/TTS/tts/layers/tacotron/capacitron_layers.py", line 67, in forward
    self.approximate_posterior_distribution = MVN(mu, torch.diag_embed(sigma))
  File "~/.local/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py", line 150, in __init__
    super(MultivariateNormal, self).__i
[trainer_0_log.txt](https://github.com/coqui-ai/TTS/files/10831850/trainer_0_log.txt)
nit__(batch_shape, event_shape, validate_args=validate_args)
File "~/.local/lib/python3.10/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (64, 128)) of distribution MultivariateNormal(loc: torch.Size([64, 128]), covariance_matrix: torch.Size([64, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1070 Ti"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.11.1",
        "numpy": "1.21.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023"
    }
}

Additional context

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig from TTS.tts.configs.shared_configs import BaseDatasetConfig, CapacitronVAEConfig from TTS.tts.configs.tacotron2_config import Tacotron2Config from TTS.tts.datasets import load_tts_samples from TTS.tts.models.tacotron2 import Tacotron2 from TTS.tts.utils.text.tokenizer import TTSTokenizer from TTS.utils.audio import AudioProcessor

output_path = '~/TTS/dev' dataset_config = BaseDatasetConfig(formatter='ljspeech', meta_file_train="metadata.csv", path= output_path)

audio_config = BaseAudioConfig( sample_rate=22050, do_trim_silence=True, trim_db=60.0, signal_norm=False, mel_fmin=0.0, mel_fmax=11025, spec_gain=1.0, log_func="np.log", ref_level_db=20, preemphasis=0.0, )

Using the standard Capacitron config

capacitron_config = CapacitronVAEConfig(capacitron_VAE_loss_alpha=1.0, capacitron_capacity=50)

config = Tacotron2Config( run_name="Capacitron-Tacotron2", audio=audio_config, capacitron_vae=capacitron_config, use_capacitron_vae=True, batch_size=64, # Tune this to your gpu max_audio_len=6 22050, # Tune this to your gpu min_audio_len=0.5 22050, eval_batch_size=16, num_loader_workers=4, num_eval_loader_workers=4, precompute_num_workers=4, run_eval=True, test_delay_epochs=100, ga_alpha=0.0, r=2, optimizer="CapacitronOptimizer", optimizer_params={"RAdam": {"betas": [0.9, 0.998], "weight_decay": 1e-6}, "SGD": {"lr": 1e-5, "momentum": 0.9}}, attention_type="dynamic_convolution", grad_clip=0.0, # Important! We overwrite the standard grad_clip with capacitron_grad_clip double_decoder_consistency=False, epochs=1000, text_cleaner="phoneme_cleaners", use_phonemes=True, phoneme_language="en-us", phonemizer="espeak", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), stopnet_pos_weight=15, print_step=10, print_eval=True, mixed_precision=False, seq_len_norm=True, output_path=output_path, datasets=[dataset_config], lr=1e-3, lr_scheduler="StepwiseGradualLR", lr_scheduler_params={ "gradual_learning_rates": [ [0, 1e-3], [2e4, 5e-4], [4e5, 3e-4], [6e4, 1e-4], [8e4, 5e-5], ] }, scheduler_after_epoch=False, # scheduler doesn't work without this flag

Need to experiment with these below for capacitron

loss_masking=False,
decoder_loss_alpha=1.0,
postnet_loss_alpha=1.0,
postnet_diff_spec_alpha=0.0,
decoder_diff_spec_alpha=0.0,
decoder_ssim_alpha=0.0,
postnet_ssim_alpha=0.0,

)

ap = AudioProcessor(**config.audio.to_dict())

tokenizer, config = TTSTokenizer.init_from_config(config)

train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)

model = Tacotron2(config, ap, tokenizer, speaker_manager=None)

trainer = Trainer( TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples, training_assets={"audio_processor": ap}, )

trainer.fit()

No response

erogol commented 1 year ago

@WeberJulian can you check?

dothehansa commented 1 year ago

I've just tried running the same script in a kaggle notebook as a check & reached the same error, so could be independent of environment?

WeberJulian commented 1 year ago

Yeah, that's a known issue, capacitron is pretty unstable to train. I see you changed the original recipe at least with the max_audio_len. Try larger batch size and different len boundries, you can also experiment with the ga_loss and others.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.