Validation loss becomes NaN

azman-i commented 3 years ago

Hi,i want to run this flowtron model for bangla dataset.But validation loss becomes NaN.What can be the possible solution for this error? exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 1: 30.651899338 2: nan 3: nan 4: nan 5: nan 6: nan 7: nan 8: nan 9: nan 10: nan 11: nan 12: nan 13: nan 14: nan 15: nan 16: nan 17: nan 18: nan 19: nan 20: nan 21: nan 22: nan 23: nan 24: nan 25: nan 26: nan 27: nan 28: nan 29: nan 30: nan 31: nan 32: nan 33: nan 34: nan 35: nan 36: nan 37: nan 38: nan 39: nan 40: nan 41: nan 42: nan

koayst commented 2 years ago

Hi,

I am having the same issue. I am training from scratch.

I first started training on a GPU with 12GB RAM. When the batch size in config.json is set to 5, it stopped training (not enough GPU ram error) at around 99000 steps.
Then I switched the batch size to 2, it was training for a few days and then now it is showing nan (as shown below).

170725: nan 170726: nan 170727: nan 170728: nan 170729: nan 170730: nan 170731: nan 170732: nan 170733: nan 170734: nan 170735: nan 170736: nan

I wonder the "nan" is a signal that I should stop training as mentioned in README.MD i.e. Under the header (Training from Scratch), train using the attention prior and the alignment loss (CTC loss) until attention looks good. Proceed to step 3, resume training without the attention prior once the alignments have stabilised.

Appreciate any help.

config.json

"train_config": {
    "output_directory": "outdir",
    "epochs": 10000000,
    "optim_algo": "RAdam",
    "learning_rate": 1e-3,
    "weight_decay": 1e-6,
    "grad_clip_val": 1,
    "sigma": 1.0,
    "iters_per_checkpoint": 1000,
    "batch_size": 3,
    "seed": 1234,
    "checkpoint_path": "",
    "ignore_layers": [],
    "finetune_layers": [],
    "include_layers": ["speaker", "encoder", "embedding"],
    "warmstart_checkpoint_path": "",
    "with_tensorboard": true,
    "fp16_run": true,
    "gate_loss": true,
    "use_ctc_loss": true,
    "ctc_loss_weight": 0.01,
    "blank_logprob": -8,
    "ctc_loss_start_iter": 10000
},

"data_config": { "training_files": "filelists/ljs_audiopaths_text_sid_train_filelist.txt", "validation_files": "filelists/ljs_audiopaths_text_sid_val_filelist.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }

olahsymbo100 commented 2 years ago

I think it has to do with GPU configuration. I was about to post the exact same issue but decided to give it a last try. I deleted the virtual environment, all CUDA/GPU config on my PC. Then started all over.

Basically, I used these steps and it suddenly started working:

Uninstall the NVIDIA drivers installed from .run files or bundled driver from CUDA Toolkit

Add PPA graphics-drivers:

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update

Install NVIDIA driver from PPA:

sudo apt install nvidia-driver-470# or nvidia-driver-495

Install CUDA
```
sudo apt install nvidia-cuda-toolkit
```

Install Torch, Torchvision, and Torchaudio

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

My GPU is NVIDIA GeForce GTX 1660 Ti

NVIDIA / flowtron

Validation loss becomes NaN #135