NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.44k stars 3.21k forks source link

[WaveGlow/Pytorch] Cannot start training from last checkpoint #809

Open IvanRubanov opened 3 years ago

IvanRubanov commented 3 years ago

Related to WaveGlow/Pytorch

Trying to continue training from checkpoint ends up in crashing python script. Starting training with following command:

! bash scripts/train_waveglow.sh

Get following output:

DLL 2021-01-12 21:51:52.587361 - PARAMETER output : /home/user/checkpoints/ 
DLL 2021-01-12 21:51:52.587441 - PARAMETER dataset_path : /home/user/datasets/finnish-single-speaker-speech-dataset/data 
DLL 2021-01-12 21:51:52.587476 - PARAMETER model_name : WaveGlow 
DLL 2021-01-12 21:51:52.587517 - PARAMETER log_file : /home/user/logs/nvlog.json 
DLL 2021-01-12 21:51:52.587567 - PARAMETER anneal_steps : None 
DLL 2021-01-12 21:51:52.587619 - PARAMETER anneal_factor : 0.1 
DLL 2021-01-12 21:51:52.587669 - PARAMETER config_file : None 
DLL 2021-01-12 21:51:52.587722 - PARAMETER epochs : 1501 
DLL 2021-01-12 21:51:52.587786 - PARAMETER epochs_per_checkpoint : 1 
DLL 2021-01-12 21:51:52.587838 - PARAMETER checkpoint_path :  
DLL 2021-01-12 21:51:52.587888 - PARAMETER resume_from_last : True 
DLL 2021-01-12 21:51:52.587940 - PARAMETER dynamic_loss_scaling : True 
DLL 2021-01-12 21:51:52.587990 - PARAMETER amp : False 
DLL 2021-01-12 21:51:52.588041 - PARAMETER cudnn_enabled : True 
DLL 2021-01-12 21:51:52.588091 - PARAMETER cudnn_benchmark : True 
DLL 2021-01-12 21:51:52.588143 - PARAMETER disable_uniform_initialize_bn_weight : False 
DLL 2021-01-12 21:51:52.588199 - PARAMETER use_saved_learning_rate : False 
DLL 2021-01-12 21:51:52.588250 - PARAMETER learning_rate : 0.0001 
DLL 2021-01-12 21:51:52.588300 - PARAMETER weight_decay : 0.0 
DLL 2021-01-12 21:51:52.588351 - PARAMETER grad_clip_thresh : 3.4028234663852886e+38 
DLL 2021-01-12 21:51:52.588403 - PARAMETER batch_size : 4 
DLL 2021-01-12 21:51:52.588455 - PARAMETER grad_clip : 5.0 
DLL 2021-01-12 21:51:52.588506 - PARAMETER load_mel_from_disk : False 
DLL 2021-01-12 21:51:52.588557 - PARAMETER training_files : /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/train.csv 
DLL 2021-01-12 21:51:52.588607 - PARAMETER validation_files : /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/validate.csv 
DLL 2021-01-12 21:51:52.588657 - PARAMETER text_cleaners : ['basic_cleaners'] 
DLL 2021-01-12 21:51:52.588726 - PARAMETER max_wav_value : 32768.0 
DLL 2021-01-12 21:51:52.588791 - PARAMETER sampling_rate : 22050 
DLL 2021-01-12 21:51:52.588842 - PARAMETER filter_length : 1024 
DLL 2021-01-12 21:51:52.588891 - PARAMETER hop_length : 256 
DLL 2021-01-12 21:51:52.588941 - PARAMETER win_length : 1024 
DLL 2021-01-12 21:51:52.588991 - PARAMETER mel_fmin : 0.0 
DLL 2021-01-12 21:51:52.589043 - PARAMETER mel_fmax : 8000.0 
DLL 2021-01-12 21:51:52.589097 - PARAMETER rank : 0 
DLL 2021-01-12 21:51:52.589147 - PARAMETER world_size : 1 
DLL 2021-01-12 21:51:52.589197 - PARAMETER dist_url : tcp://localhost:23456 
DLL 2021-01-12 21:51:52.589247 - PARAMETER group_name : group_name 
DLL 2021-01-12 21:51:52.589298 - PARAMETER dist_backend : nccl 
DLL 2021-01-12 21:51:52.589348 - PARAMETER bench_class :  
DLL 2021-01-12 21:51:52.589397 - PARAMETER model_name : Tacotron2_PyT 
Loading checkpoint from symlink /home/user/checkpoints/checkpoint_WaveGlow_last.pt
scripts/train_waveglow.sh: line 2:   759 Killed                  python train.py -m WaveGlow -o /home/user/checkpoints/ -lr 1e-4 --epochs 1501 -bs 4 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 3.4028234663852886e+38 --cudnn-enabled --cudnn-benchmark --log-file /home/user/logs/nvlog.json --epochs-per-checkpoint 1 --dataset-path /home/user/datasets/finnish-single-speaker-speech-dataset/data --training-files /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/train.csv --validation-files /home/user/datasets/finnish-single-speaker-speech-dataset/filelists/validate.csv --resume-from-last --amp

Environment

Added print statement in to the train.py. Crashing in following line:

optimizer.load_state_dict(checkpoint['optimizer'])

Unfortunately, there is no error stack trace or any similar crash log. Could anyone suggest how can I debug the issue? What could be the reason?

ghost commented 3 years ago

Hi @IvanRubanov without stack trace it won't be easy, but let's try. Just to be sure, is /home/user/checkpoints/checkpoint_WaveGlow_last.pt a symlink or an actual file? Is it a checkpoint you trained yourself or was it downloaded?

IvanRubanov commented 3 years ago

Hello, thanks for replay. File is a symlink. It is my own attempt to train waveglow. What should I do to provide a stack trace?