NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.11k stars 1.39k forks source link

Hard to train on one small GPU #30

Closed yliess86 closed 6 years ago

yliess86 commented 6 years ago

I wrote this issue in response to this one.

The issue is that I am using an 8Gb GPU (NVIDIA GeForce GTX 980M) so I've brought down batch size to 24 with the LJ Speech dataset. And after 4 days (I am now at 31k iterations), I do not see any improvement in the attention alignement. The loss is stucked between 0.7 and 0.5 almost since the start of the training.

At this point, do I have to wait longer or do you consider it is a failure ?

Here are my curves and alignement plot at this point: Logs alignement

rafaelvalle commented 6 years ago

Thank you for sharing this. Please share your hparams files and training curve? Did you train this with eight GTX 980M or a just single GTX 980M? Did you do any modifications to the model, hparams or data?

yliess86 commented 6 years ago

The only modification I did to the hparams here is the batch size which is now 24. And yes it is only with a signle GPU.

I don't know if you can access my tensorboard envents.out by clicking on the Logs url: (Logs) If you prefere me to take some screenshots I can do it tomorrow.

Here are the params:

        ################################
        # Experiment Parameters        #
        ################################
        epochs=500,
        iters_per_checkpoint=500,
        seed=1234,
        dynamic_loss_scaling=True,
        fp16_run=False,
        distributed_run=False,
        dist_backend="nccl",
        dist_url="file://distributed.dpt",
        cudnn_enabled=True,
        cudnn_benchmark=False,

        ################################
        # Data Parameters             #
        ################################
        load_mel_from_disk=False,
        training_files='filelists/ljs_audio_text_train_filelist.txt',
        validation_files='filelists/ljs_audio_text_val_filelist.txt',
        text_cleaners=['english_cleaners'],
        sort_by_length=False,

        ################################
        # Audio Parameters             #
        ################################
        max_wav_value=32768.0,
        sampling_rate=22050,
        filter_length=1024,
        hop_length=256,
        win_length=1024,
        n_mel_channels=80,
        mel_fmin=0.0,
        mel_fmax=None,  # if None, half the sampling rate

        ################################
        # Model Parameters             #
        ################################
        n_symbols=len(symbols),
        symbols_embedding_dim=512,

        # Encoder parameters
        encoder_kernel_size=5,
        encoder_n_convolutions=3,
        encoder_embedding_dim=512,

        # Decoder parameters
        n_frames_per_step=1,
        decoder_rnn_dim=1024,
        prenet_dim=256,
        max_decoder_steps=1000,
        gate_threshold=0.6,

        # Attention parameters
        attention_rnn_dim=1024,
        attention_dim=128,

        # Location Layer parameters
        attention_location_n_filters=32,
        attention_location_kernel_size=31,

        # Mel-post processing network parameters
        postnet_embedding_dim=512,
        postnet_kernel_size=5,
        postnet_n_convolutions=5,

        ################################
        # Optimization Hyperparameters #
        ################################
        learning_rate=1e-3,
        weight_decay=1e-6,
        grad_clip_thresh=1,
        batch_size=24,
        mask_padding=False  # set model's padded outputs to padded values
rafaelvalle commented 6 years ago

Please do share an image of the training/validation/etc curves.

rafaelvalle commented 6 years ago

Can you please pull from master and train again? On a single GPU with batch size 48 the model learns ok attention in 8.5k iterations. screen shot 2018-06-07 at 8 29 09 pm

yliess86 commented 6 years ago

I am finally starting to see improvement in the alignement. alignement I think it is just taking so much time because of the batch size (24) and because of my GPU. I will let you know when the model will generate descent sounds and give you all my curves.

ZuoChenFttS commented 6 years ago

How many iterations you have done to see improvement in the alignement? And anything you think it is helpful to learn alignement? I have train 29000 iterations with batch size 64 in a single GPU. I decay learning rate from 0.001 to 0.0002 slowly. other hparams is the same as this hparams.py here is my alignement plot at 29200 iterations: it seems worse than your 31k iterations . default

ps: I trained on the LJ Speech dataset too and the loss is stucked between 0.6 and 0.5

yliess86 commented 6 years ago

I did not have any alignement too at 29k iterations it just started to align after 31k and showing 'good' alignement after 38k. As you can see it was not good for me at 29k. 29k

rafaelvalle commented 6 years ago

@yliess86 is this using the recent, yesterday, changes made to master? @ZuoChenFttS Note that we made changes to master yesterday to speed-up convergence. if you haven't pulled them, please do so and train again. On a NVIDIA Volta GPU with 16gb and batch size 48 the current code on master learns ok attention at 8.5k iterations.

ZuoChenFttS commented 6 years ago

@rafaelvalle thanks for your help. I trained the master in the last weekend. After 9600 iterations, I get the alignement plot here. It seem start learn "good" alignment. 9600 I'll continue to train it. When the model will generate good sounds, I will share a example audio and the curves.

yliess86 commented 6 years ago

@rafaelvalle No, this was not using the latest changes since I started the training 9 days ago. It about the same state as @ZuoChenFttS now. Considering I need to make a prototype for my projet I will continue trainning my model with this version. (Don't want to stop now after 9 days) But I will definitly retrain it with the new changes after thank you.

rafaelvalle commented 6 years ago

Sounds good! We modified the embedding layer to be initialized with xavier uniform and this change makes the model learn attention much faster at around 8k iterations. You're attention looks good! Just keep training an anneal the learning rate once the slope of the training loss becomes flat.

ZuoChenFttS commented 6 years ago

@yliess86 I see you ask things about loss in other issues. My Validation loss 19600 iters: 0.486099 about 100 epoch with batch size 64. And what your loss now? Now, my model can synthesize some audio, which I can recognition the words but the sound have a lot noise. By the way, my synthesize audio code is like inference.ipynb, so I do not use wavnet vocoder.

yliess86 commented 6 years ago

@ZuoChenFttS Hi, I've started training mine before the changes in master (without xavier initialisation) so my validation loss is 0.46 on 100000 iterations with batch 24. The audio is quite audible and we can esaily understand what it says but their still is noise and when I use wavenet it sounds like it has the flu. I don't know if we can achieve the same loss with this repo, Rayhane-mamah implementation is a bit different, but I've read he achieved about 0.18 loss on LJ Speech dataset. So maybe we have to train even more.

Here is an audio example if you want to see: example It was using checkpoint from iteration 70000.

rafaelvalle commented 6 years ago

@ZuoChenFttS Can you share your audio sample, image of mel spectrogram and mel spectrogram file? Using the WaveNet decoder is very important for audio quality! @yliess86 the loss is not directly comparable because Rayhane-Mamah uses a Mel representation different from this repo. I don't remember exactly but think that the demo sample on this repos was generated with a model with ~0.4 loss.

yliess86 commented 6 years ago

@rafaelvalle Thank you I wasn't sure about this.

ZuoChenFttS commented 6 years ago

@rafaelvalle Here is an audio example and mel_outputs file. This is not using wavenet decoder, I will try it later. text = 'The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.' english_23200iter.zip

plot pic: english_23200

rafaelvalle commented 6 years ago

@ZuoChenFttS Your audio has "noise" because it was synthesized using Griffin-Lim. You'll need WaveNet to reach state of the art quality.

rafaelvalle commented 6 years ago

Closing due to inactivity. Please re-open if needed.