I am having the same issue. I am training from scratch.
I first started training on a GPU with 12GB RAM. When the batch size in config.json is set to 5, it stopped training (not enough GPU ram error) at around 99000 steps.
Then I switched the batch size to 2, it was training for a few days and then now it is showing nan (as shown below).
170725: nan 170726: nan 170727: nan
Appreciate any help.
"train_config": {
"output_directory": "outdir",
"epochs": 10000000,
"optim_algo": "RAdam",
"learning_rate": 1e-3,
"weight_decay": 1e-6,
"grad_clip_val": 1,
"sigma": 1.0,
"iters_per_checkpoint": 1000,
"batch_size": 3,
"seed": 1234,
"checkpoint_path": "",
"ignore_layers": [],
"finetune_layers": [],
"include_layers": ["speaker", "encoder", "embedding"],
"warmstart_checkpoint_path": "",
"with_tensorboard": true,
"fp16_run": true,
"gate_loss": true,
"use_ctc_loss": true,
"ctc_loss_weight": 0.01,
"blank_logprob": -8,
"ctc_loss_start_iter": 10000
"data_config": { "training_files": "filelists/ljs_audiopaths_text_sid_train_filelist.txt", "validation_files": "filelists/ljs_audiopaths_text_sid_val_filelist.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
I think it has to do with GPU configuration. I was about to post the exact same issue but decided to give it a last try. I deleted the virtual environment, all CUDA/GPU config on my PC. Then started all over.
Basically, I used these steps and it suddenly started working:
Uninstall the NVIDIA drivers installed from .run
files or bundled driver from CUDA Toolkit
Add PPA graphics-drivers
sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update
Install NVIDIA driver from PPA:
sudo apt install nvidia-driver-470# or nvidia-driver-495
Install CUDA
sudo apt install nvidia-cuda-toolkit
Install Torch, Torchvision, and Torchaudio
pip3 install torch torchvision torchaudio --extra-index-url
My GPU is NVIDIA GeForce GTX 1660 Ti
Hi,i want to run this flowtron model for bangla dataset.But validation loss becomes NaN.What can be the possible solution for this error?
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 1: 30.651899338 2: nan 3: nan 4: nan 5: nan