Open JoeyHeisenberg opened 4 years ago
Are you using the latest version of master? This problem was coming in earlier versions.. @JoeyHeisenberg
sorry for the late response. and I git clone it just last week, here is the result of "git log" @Jeevesh8
Would it be possible for you to add the following code just above the last line of loss_function.py
print("Mel Loss:- ", mel_loss)
print("gate_loss :- ", gate_loss)
print("speaker_loss :- ", speaker_loss)
print("kl loss:- ", kl_loss)
print("Total Loss:- ", (mel_loss + gate_loss) + 0.02*speaker_loss +kl_loss)
and see which loss becomes NaN first?
Also, you can try reducing learning rate. And reducing hparams.mcn to 1 or 2.
It'd be helpful if you can check if the same thing is happening on single GPU too? Please attach your entire hparams.py file too, if possible.
@JoeyHeisenberg
I made a little correction. You can try with that. @JoeyHeisenberg
@Jeevesh8 I add the "print" code to run, and here was the results
and I git pull the lastest code, but I got this error
Traceback (most recent call last):
File "train.py", line 292, in
File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/residual_encoder.py", line 98, in calc_q_tilde
ans = p_zl_givn_yl*self.y_l.probs
RuntimeError: expected device cuda:0 but got device cpu
here is the hparams.py `import tensorflow as tf from text import symbols
def create_hparams(hparams_string=None, verbose=False): """Create model hyperparameters. Parse nondefault from given string."""
hparams = tf.contrib.training.HParams(
################################
# Experiment Parameters #
################################
epochs=500,
iters_per_checkpoint=1000,
seed=1234,
dynamic_loss_scaling=True,
fp16_run=False,
distributed_run=True,
dist_backend="nccl",
dist_url="tcp://localhost:54321",
cudnn_enabled=True,
cudnn_benchmark=False,
ignore_layers=['embedding.weight'],
################################
# Data Parameters #
################################
load_mel_from_disk=False,
training_files='./filelists/train.txt',
validation_files='./filelists/valid.txt',
text_cleaners=['basic_cleaners'],
################################
# Audio Parameters #
################################
max_wav_value=32768.0,
sampling_rate=16000,
filter_length=1280,
hop_length=320,
win_length=1280,
n_mel_channels=80,
mel_fmin=80.0,
mel_fmax=7600.0,
################################
# Model Parameters #
################################
n_symbols=len(symbols),
symbols_embedding_dim=512,
# Encoder parameters
encoder_kernel_size=5,
encoder_n_convolutions=3,
encoder_embedding_dim=512,
# Decoder parameters
n_frames_per_step=1, # currently only 1 is supported
decoder_rnn_dim=1024,
prenet_dim=256,
max_decoder_steps=1000,
gate_threshold=0.5,
p_attention_dropout=0.1,
p_decoder_dropout=0.1,
# Attention parameters
attention_rnn_dim=1024,
attention_dim=128,
# Location Layer parameters
attention_location_n_filters=32,
attention_location_kernel_size=31,
# Mel-post processing network parameters
postnet_embedding_dim=512,
postnet_kernel_size=5,
postnet_n_convolutions=5,
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values
###############################
# Speaker and Lang Embeddings #
###############################
speaker_embedding_dim=64,
lang_embedding_dim=3,
n_langs=2,
n_speakers=7,
###############################
## Speaker Classifier Params ##
###############################
hidden_sc_dim=256,
##############################
## Residual Encoder Params ##
##############################
residual_encoding_dim=32, # 16 for q(z_l|X) and 16 for q(z_o|X)
dim_yo=7, #(==n_speakers) dim(y_{o})
dim_yl=10, #K
mcn=8 # n for monte carlo sampling of q(z_l|X)and q(z_o|X)
)
if hparams_string:
tf.logging.info('Parsing command line hparams: %s', hparams_string)
hparams.parse(hparams_string)
if verbose:
tf.logging.info('Final parsed hparams: %s', hparams.values())
return hparams
`
That error in latest code has been removed now @JoeyHeisenberg . You can try now.
I reduce the learning rate from 1e-3 to 1e-4 and reduce the batchsize to 16, so far so good. @Jeevesh8
Great @JoeyHeisenberg ! In which languages are you training, if I may ask? Would you be mind sharing the learned weights ? I currently don't have access to much compute resource, so can't train my own.
THX for helpping,and the model is stilling training;and the model is trained by the Chinese dataset and EN dataset,sorry for that I can't share the learned weights cause I use the company‘s private dataset, and they don't allow us to share. maybe you can try the open source data like https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar(CSMSC,Chinese data for TTS recorded by one person)and some speech recognition dataset
Okay. Thanks :) But please just let me know after training if the results turned out to be good or not, because I have only run on a dummy dataset on colab , till now.. 👍 Also, would it be possible for you to share some compute resources? I have my own dataset of some Indic languages, but estimated time of training on the dataset on colab's GPU is around 1 month.. Actually we are a start-up so if you/your company want to collaborate, we can do @JoeyHeisenberg
The loss seems so large, and The alignment is bad
I tried to generate some wavs but failed
@JoeyHeisenberg you can use clvc-infer-gh.ipynb to produce wav files
It didn‘t sythesize the rensonable wavs (with my dataset), and I stuck with other project. I'll send the results if I make any progress
Could you at least hear some words @JoeyHeisenberg ? I would write code to check if any probability distribution has collapsed to its mode etc. You can use that. Or you can check yourself. Let me know.
I loaded the checkpoint-97000 model, and It didn't sythesize the was, Here is the figure of mel_outputs, mel_outputs_postnet, alignments
Here is the val loss , it stuck at around 2.0+
@JoeyHeisenberg how did you set up for multi-GPU on a single system?
@JoeyHeisenberg You can try training further with reduced learning rate.
@Jeevesh8 i tried it but gets stuck for hours after Done initializing distributed.
@JoeyHeisenberg Can you show your mel target and mel predicted in tensorboard images tab?
@singhaki I set it on 3 GPUs CUDA_VISIBLE_DEVICES=0,1,2 nohup python -u multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=3 --hparams=distributed_run=True > log_tacotron2_v2.file 2>&1 &
@Jeevesh8 Here is the mel target and mel predicted
@JoeyHeisenberg The mel-specs seem close.
1.) Can you upload your log directory to google drive and share ? OR Tell if attention maps (in images tab only) are improving during the epochs when loss is constant ? (Because I read some issue on nvidia/tacotron2, and they had told there , that although loss becomes constant after some epochs, the alignments still go on improving, I will mention that issue here, when I find it)
2.) Also, Can you tell whether these mel-spec alignment are corresponding to inference on cross-lingual case or same-language case ?
@JoeyHeisenberg Also, please let know what happens when you train further with lower learning rate ?
I check the audio, and data scale is int16 as follows
and about the silent parts at the beginning and end, I actually set "start0" and "end0" for them, It should be fine.
Now, I'm facing two problem the first is on tensorboard, the mel-spec predicted seems same as the mel-spec target, but the aligment don't look fine(same-language case), I already train 96000 step with batchsize 24, maybe I should train further as you mention with lower lr the second is that I can not sythesize the normal audio with my waveGan
here are two samples from train.txt and all phonemes are put on symbols.py, and I didn't change other code chinese 000718.wav|start0 ou2 er2 m ai3 i4 x ie1 z iy1 l iao4 h uo4 i4 sh uang1 ua4 z iy5 sp1 d eng3 c uen2 g ou4 l e5 sp1 h uei4 q v4 m ai3 i2 t ao4 x in1 i1 sh ang5 end0|0|0 English 004291.wav|start0 T EY1 K IH0 NG AH1 P S M OW1 K IH0 NG R AE1 NG K S HH AY1 IH0 N AE1 K SH AH0 N Z P IY1 P AH0 L W IH1 SH DH EY1 K UH1 D S T AH1 B AW1 T end0|6|1
maybe I made some mistakes on textProcessing, I will check the code to fix these problem, but recently I have to do other project first. I will let you know if I make some progress and really thank you very much for your help. @Jeevesh8
@JoeyHeisenberg hello !I want to ask if the multilingual model you trained is effective?
I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan
hparams ################################
Optimization Hyperparameters
################################ use_saved_learning_rate=False, learning_rate=1e-3, weight_decay=1e-6, grad_clip_thresh=1.0, batch_size=24, mask_padding=True, # set model's padded outputs to padded values
nvidia-smi when failed
FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it
Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.