Closed yliess86 closed 6 years ago
Thank you for sharing this. Please share your hparams files and training curve? Did you train this with eight GTX 980M or a just single GTX 980M? Did you do any modifications to the model, hparams or data?
The only modification I did to the hparams here is the batch size which is now 24. And yes it is only with a signle GPU.
I don't know if you can access my tensorboard envents.out by clicking on the Logs url: (Logs) If you prefere me to take some screenshots I can do it tomorrow.
Here are the params:
################################
# Experiment Parameters #
################################
epochs=500,
iters_per_checkpoint=500,
seed=1234,
dynamic_loss_scaling=True,
fp16_run=False,
distributed_run=False,
dist_backend="nccl",
dist_url="file://distributed.dpt",
cudnn_enabled=True,
cudnn_benchmark=False,
################################
# Data Parameters #
################################
load_mel_from_disk=False,
training_files='filelists/ljs_audio_text_train_filelist.txt',
validation_files='filelists/ljs_audio_text_val_filelist.txt',
text_cleaners=['english_cleaners'],
sort_by_length=False,
################################
# Audio Parameters #
################################
max_wav_value=32768.0,
sampling_rate=22050,
filter_length=1024,
hop_length=256,
win_length=1024,
n_mel_channels=80,
mel_fmin=0.0,
mel_fmax=None, # if None, half the sampling rate
################################
# Model Parameters #
################################
n_symbols=len(symbols),
symbols_embedding_dim=512,
# Encoder parameters
encoder_kernel_size=5,
encoder_n_convolutions=3,
encoder_embedding_dim=512,
# Decoder parameters
n_frames_per_step=1,
decoder_rnn_dim=1024,
prenet_dim=256,
max_decoder_steps=1000,
gate_threshold=0.6,
# Attention parameters
attention_rnn_dim=1024,
attention_dim=128,
# Location Layer parameters
attention_location_n_filters=32,
attention_location_kernel_size=31,
# Mel-post processing network parameters
postnet_embedding_dim=512,
postnet_kernel_size=5,
postnet_n_convolutions=5,
################################
# Optimization Hyperparameters #
################################
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1,
batch_size=24,
mask_padding=False # set model's padded outputs to padded values
Please do share an image of the training/validation/etc curves.
Can you please pull from master and train again? On a single GPU with batch size 48 the model learns ok attention in 8.5k iterations.
I am finally starting to see improvement in the alignement. I think it is just taking so much time because of the batch size (24) and because of my GPU. I will let you know when the model will generate descent sounds and give you all my curves.
How many iterations you have done to see improvement in the alignement? And anything you think it is helpful to learn alignement? I have train 29000 iterations with batch size 64 in a single GPU. I decay learning rate from 0.001 to 0.0002 slowly. other hparams is the same as this hparams.py here is my alignement plot at 29200 iterations: it seems worse than your 31k iterations .
ps: I trained on the LJ Speech dataset too and the loss is stucked between 0.6 and 0.5
I did not have any alignement too at 29k iterations it just started to align after 31k and showing 'good' alignement after 38k. As you can see it was not good for me at 29k.
@yliess86 is this using the recent, yesterday, changes made to master? @ZuoChenFttS Note that we made changes to master yesterday to speed-up convergence. if you haven't pulled them, please do so and train again. On a NVIDIA Volta GPU with 16gb and batch size 48 the current code on master learns ok attention at 8.5k iterations.
@rafaelvalle thanks for your help. I trained the master in the last weekend. After 9600 iterations, I get the alignement plot here. It seem start learn "good" alignment. I'll continue to train it. When the model will generate good sounds, I will share a example audio and the curves.
@rafaelvalle No, this was not using the latest changes since I started the training 9 days ago. It about the same state as @ZuoChenFttS now. Considering I need to make a prototype for my projet I will continue trainning my model with this version. (Don't want to stop now after 9 days) But I will definitly retrain it with the new changes after thank you.
Sounds good! We modified the embedding layer to be initialized with xavier uniform and this change makes the model learn attention much faster at around 8k iterations. You're attention looks good! Just keep training an anneal the learning rate once the slope of the training loss becomes flat.
@yliess86 I see you ask things about loss in other issues. My Validation loss 19600 iters: 0.486099 about 100 epoch with batch size 64. And what your loss now? Now, my model can synthesize some audio, which I can recognition the words but the sound have a lot noise. By the way, my synthesize audio code is like inference.ipynb, so I do not use wavnet vocoder.
@ZuoChenFttS Hi, I've started training mine before the changes in master (without xavier initialisation) so my validation loss is 0.46 on 100000 iterations with batch 24. The audio is quite audible and we can esaily understand what it says but their still is noise and when I use wavenet it sounds like it has the flu. I don't know if we can achieve the same loss with this repo, Rayhane-mamah implementation is a bit different, but I've read he achieved about 0.18 loss on LJ Speech dataset. So maybe we have to train even more.
Here is an audio example if you want to see: example It was using checkpoint from iteration 70000.
@ZuoChenFttS Can you share your audio sample, image of mel spectrogram and mel spectrogram file? Using the WaveNet decoder is very important for audio quality! @yliess86 the loss is not directly comparable because Rayhane-Mamah uses a Mel representation different from this repo. I don't remember exactly but think that the demo sample on this repos was generated with a model with ~0.4 loss.
@rafaelvalle Thank you I wasn't sure about this.
@rafaelvalle Here is an audio example and mel_outputs file. This is not using wavenet decoder, I will try it later. text = 'The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.' english_23200iter.zip
plot pic:
@ZuoChenFttS Your audio has "noise" because it was synthesized using Griffin-Lim. You'll need WaveNet to reach state of the art quality.
Closing due to inactivity. Please re-open if needed.
I wrote this issue in response to this one.
The issue is that I am using an 8Gb GPU (NVIDIA GeForce GTX 980M) so I've brought down batch size to 24 with the LJ Speech dataset. And after 4 days (I am now at 31k iterations), I do not see any improvement in the attention alignement. The loss is stucked between 0.7 and 0.5 almost since the start of the training.
At this point, do I have to wait longer or do you consider it is a failure ?
Here are my curves and alignement plot at this point: Logs