deterministic-algorithms-lab / Cross-Lingual-Voice-Cloning

Tacotron 2 - PyTorch implementation with faster-than-realtime inference modified to enable cross lingual voice cloning.
BSD 3-Clause "New" or "Revised" License
357 stars 58 forks source link

Train failed because the loss is nan #3

Open JoeyHeisenberg opened 4 years ago

JoeyHeisenberg commented 4 years ago

I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan

hparams ################################

Optimization Hyperparameters

################################ use_saved_learning_rate=False, learning_rate=1e-3, weight_decay=1e-6, grad_clip_thresh=1.0, batch_size=24, mask_padding=True, # set model's padded outputs to padded values

nvidia-smi when failed image

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it

Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

Jeevesh8 commented 4 years ago

Are you using the latest version of master? This problem was coming in earlier versions.. @JoeyHeisenberg

JoeyHeisenberg commented 4 years ago

sorry for the late response. and I git clone it just last week, here is the result of "git log" image @Jeevesh8

Jeevesh8 commented 4 years ago

Would it be possible for you to add the following code just above the last line of loss_function.py
print("Mel Loss:- ", mel_loss) print("gate_loss :- ", gate_loss) print("speaker_loss :- ", speaker_loss) print("kl loss:- ", kl_loss) print("Total Loss:- ", (mel_loss + gate_loss) + 0.02*speaker_loss +kl_loss) and see which loss becomes NaN first? Also, you can try reducing learning rate. And reducing hparams.mcn to 1 or 2. It'd be helpful if you can check if the same thing is happening on single GPU too? Please attach your entire hparams.py file too, if possible. @JoeyHeisenberg

Jeevesh8 commented 4 years ago

I made a little correction. You can try with that. @JoeyHeisenberg

JoeyHeisenberg commented 4 years ago

@Jeevesh8 I add the "print" code to run, and here was the results image

and I git pull the lastest code, but I got this error Traceback (most recent call last): File "train.py", line 292, in args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 216, in train y_pred = model(x) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 562, in forward encoder_outputs, mels, memory_lengths=text_lengths, speaker=speaker, lang=lang) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/model.py", line 431, in forward residual_encoding = self.residual_encoder(decoder_inputs) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 112, in forward self.calc_q_tilde(z_l)
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 98, in calc_q_tilde ans = p_zl_givn_yl*self.y_l.probs RuntimeError: expected device cuda:0 but got device cpu

here is the hparams.py `import tensorflow as tf from text import symbols

def create_hparams(hparams_string=None, verbose=False): """Create model hyperparameters. Parse nondefault from given string."""

hparams = tf.contrib.training.HParams(
    ################################
    # Experiment Parameters        #
    ################################
    epochs=500,
    iters_per_checkpoint=1000,
    seed=1234,
    dynamic_loss_scaling=True,
    fp16_run=False,
    distributed_run=True,
    dist_backend="nccl",
    dist_url="tcp://localhost:54321",
    cudnn_enabled=True,
    cudnn_benchmark=False,
    ignore_layers=['embedding.weight'],

    ################################
    # Data Parameters             #
    ################################
    load_mel_from_disk=False,
    training_files='./filelists/train.txt',
    validation_files='./filelists/valid.txt',
    text_cleaners=['basic_cleaners'],

    ################################
    # Audio Parameters             #
    ################################
    max_wav_value=32768.0,
    sampling_rate=16000,
    filter_length=1280,
    hop_length=320,
    win_length=1280,
    n_mel_channels=80,
    mel_fmin=80.0,
    mel_fmax=7600.0,

    ################################
    # Model Parameters             #
    ################################
    n_symbols=len(symbols),
    symbols_embedding_dim=512,

    # Encoder parameters
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,

    # Decoder parameters
    n_frames_per_step=1,  # currently only 1 is supported
    decoder_rnn_dim=1024,
    prenet_dim=256,
    max_decoder_steps=1000,
    gate_threshold=0.5,
    p_attention_dropout=0.1,
    p_decoder_dropout=0.1,

    # Attention parameters
    attention_rnn_dim=1024,
    attention_dim=128,

    # Location Layer parameters
    attention_location_n_filters=32,
    attention_location_kernel_size=31,

    # Mel-post processing network parameters
    postnet_embedding_dim=512,
    postnet_kernel_size=5,
    postnet_n_convolutions=5,

    ################################
    # Optimization Hyperparameters #
    ################################
    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1.0,
    batch_size=24,
    mask_padding=True,  # set model's padded outputs to padded values

    ###############################
    # Speaker and Lang Embeddings #
    ###############################
    speaker_embedding_dim=64,
    lang_embedding_dim=3,
    n_langs=2,
    n_speakers=7,

    ###############################
    ## Speaker Classifier Params ##
    ###############################
    hidden_sc_dim=256,

    ##############################
    ## Residual Encoder Params  ##
    ##############################
    residual_encoding_dim=32,          # 16 for q(z_l|X) and 16 for q(z_o|X)
    dim_yo=7,                          #(==n_speakers) dim(y_{o})
    dim_yl=10,                         #K
    mcn=8                              # n for monte carlo sampling of q(z_l|X)and q(z_o|X)
)

if hparams_string:
    tf.logging.info('Parsing command line hparams: %s', hparams_string)
    hparams.parse(hparams_string)

if verbose:
    tf.logging.info('Final parsed hparams: %s', hparams.values())

return hparams

`

Jeevesh8 commented 4 years ago

That error in latest code has been removed now @JoeyHeisenberg . You can try now.

JoeyHeisenberg commented 4 years ago

I reduce the learning rate from 1e-3 to 1e-4 and reduce the batchsize to 16, so far so good. @Jeevesh8 image

Jeevesh8 commented 4 years ago

Great @JoeyHeisenberg ! In which languages are you training, if I may ask? Would you be mind sharing the learned weights ? I currently don't have access to much compute resource, so can't train my own.

JoeyHeisenberg commented 4 years ago

THX for helpping,and the model is stilling training;and the model is trained by the Chinese dataset and EN dataset,sorry for that I can't share the learned weights cause I use the company‘s private dataset, and they don't allow us to share. maybe you can try the open source data like https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar(CSMSC,Chinese data for TTS recorded by one person)and some speech recognition dataset

JoeyHeisenberg commented 4 years ago

image

Jeevesh8 commented 4 years ago

Okay. Thanks :) But please just let me know after training if the results turned out to be good or not, because I have only run on a dummy dataset on colab , till now.. 👍 Also, would it be possible for you to share some compute resources? I have my own dataset of some Indic languages, but estimated time of training on the dataset on colab's GPU is around 1 month.. Actually we are a start-up so if you/your company want to collaborate, we can do @JoeyHeisenberg

JoeyHeisenberg commented 4 years ago

The loss seems so large, and The alignment is bad image image image

I tried to generate some wavs but failed image

Jeevesh8 commented 4 years ago

@JoeyHeisenberg you can use clvc-infer-gh.ipynb to produce wav files

JoeyHeisenberg commented 4 years ago

It didn‘t sythesize the rensonable wavs (with my dataset), and I stuck with other project. I'll send the results if I make any progress

Jeevesh8 commented 4 years ago

Could you at least hear some words @JoeyHeisenberg ? I would write code to check if any probability distribution has collapsed to its mode etc. You can use that. Or you can check yourself. Let me know.

JoeyHeisenberg commented 4 years ago

I loaded the checkpoint-97000 model, and It didn't sythesize the was, Here is the figure of mel_outputs, mel_outputs_postnet, alignments

image

Here is the val loss , it stuck at around 2.0+ image

akashicMarga commented 4 years ago

@JoeyHeisenberg how did you set up for multi-GPU on a single system?

Jeevesh8 commented 4 years ago

@JoeyHeisenberg You can try training further with reduced learning rate.

Jeevesh8 commented 4 years ago

@singhaki you can do like this

akashicMarga commented 4 years ago

@Jeevesh8 i tried it but gets stuck for hours after Done initializing distributed.

Jeevesh8 commented 4 years ago

@JoeyHeisenberg Can you show your mel target and mel predicted in tensorboard images tab?

JoeyHeisenberg commented 4 years ago

@singhaki I set it on 3 GPUs CUDA_VISIBLE_DEVICES=0,1,2 nohup python -u multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=3 --hparams=distributed_run=True > log_tacotron2_v2.file 2>&1 &

@Jeevesh8 Here is the mel target and mel predicted image

Jeevesh8 commented 4 years ago

@JoeyHeisenberg The mel-specs seem close.

1.) Can you upload your log directory to google drive and share ? OR Tell if attention maps (in images tab only) are improving during the epochs when loss is constant ? (Because I read some issue on nvidia/tacotron2, and they had told there , that although loss becomes constant after some epochs, the alignments still go on improving, I will mention that issue here, when I find it)

2.) Also, Can you tell whether these mel-spec alignment are corresponding to inference on cross-lingual case or same-language case ?

Jeevesh8 commented 4 years ago

@JoeyHeisenberg Also, please let know what happens when you train further with lower learning rate ?

Jeevesh8 commented 4 years ago

@JoeyHeisenberg please make sure all 3 points here are true.

JoeyHeisenberg commented 4 years ago

I check the audio, and data scale is int16 as follows image

and about the silent parts at the beginning and end, I actually set "start0" and "end0" for them, It should be fine.

Now, I'm facing two problem the first is on tensorboard, the mel-spec predicted seems same as the mel-spec target, but the aligment don't look fine(same-language case), I already train 96000 step with batchsize 24, maybe I should train further as you mention with lower lr the second is that I can not sythesize the normal audio with my waveGan

here are two samples from train.txt and all phonemes are put on symbols.py, and I didn't change other code chinese 000718.wav|start0 ou2 er2 m ai3 i4 x ie1 z iy1 l iao4 h uo4 i4 sh uang1 ua4 z iy5 sp1 d eng3 c uen2 g ou4 l e5 sp1 h uei4 q v4 m ai3 i2 t ao4 x in1 i1 sh ang5 end0|0|0 English 004291.wav|start0 T EY1 K IH0 NG AH1 P S M OW1 K IH0 NG R AE1 NG K S HH AY1 IH0 N AE1 K SH AH0 N Z P IY1 P AH0 L W IH1 SH DH EY1 K UH1 D S T AH1 B AW1 T end0|6|1

maybe I made some mistakes on textProcessing, I will check the code to fix these problem, but recently I have to do other project first. I will let you know if I make some progress and really thank you very much for your help. @Jeevesh8

c9412600 commented 3 years ago

@JoeyHeisenberg hello !I want to ask if the multilingual model you trained is effective?