Open azman-i opened 3 years ago
Hi,
I am having the same issue. I am training from scratch.
I first started training on a GPU with 12GB RAM. When the batch size in config.json is set to 5, it stopped training (not enough GPU ram error) at around 99000 steps.
Then I switched the batch size to 2, it was training for a few days and then now it is showing nan (as shown below).
170725: nan 170726: nan 170727: nan 170728: nan 170729: nan 170730: nan 170731: nan 170732: nan 170733: nan 170734: nan 170735: nan 170736: nan
Appreciate any help.
config.json
"train_config": {
"output_directory": "outdir",
"epochs": 10000000,
"optim_algo": "RAdam",
"learning_rate": 1e-3,
"weight_decay": 1e-6,
"grad_clip_val": 1,
"sigma": 1.0,
"iters_per_checkpoint": 1000,
"batch_size": 3,
"seed": 1234,
"checkpoint_path": "",
"ignore_layers": [],
"finetune_layers": [],
"include_layers": ["speaker", "encoder", "embedding"],
"warmstart_checkpoint_path": "",
"with_tensorboard": true,
"fp16_run": true,
"gate_loss": true,
"use_ctc_loss": true,
"ctc_loss_weight": 0.01,
"blank_logprob": -8,
"ctc_loss_start_iter": 10000
},
"data_config": { "training_files": "filelists/ljs_audiopaths_text_sid_train_filelist.txt", "validation_files": "filelists/ljs_audiopaths_text_sid_val_filelist.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
I think it has to do with GPU configuration. I was about to post the exact same issue but decided to give it a last try. I deleted the virtual environment, all CUDA/GPU config on my PC. Then started all over.
Basically, I used these steps and it suddenly started working:
Uninstall the NVIDIA drivers installed from .run
files or bundled driver from CUDA Toolkit
Add PPA graphics-drivers
:
sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update
Install NVIDIA driver from PPA:
sudo apt install nvidia-driver-470# or nvidia-driver-495
Install CUDA
sudo apt install nvidia-cuda-toolkit
Install Torch, Torchvision, and Torchaudio
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
My GPU is NVIDIA GeForce GTX 1660 Ti
Hi,i want to run this flowtron model for bangla dataset.But validation loss becomes NaN.What can be the possible solution for this error?
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 1: 30.651899338 2: nan 3: nan 4: nan 5: nan 6: nan 7: nan 8: nan 9: nan 10: nan 11: nan 12: nan 13: nan 14: nan 15: nan 16: nan 17: nan 18: nan 19: nan 20: nan 21: nan 22: nan 23: nan 24: nan 25: nan 26: nan 27: nan 28: nan 29: nan 30: nan 31: nan 32: nan 33: nan 34: nan 35: nan 36: nan 37: nan 38: nan 39: nan 40: nan 41: nan 42: nan