How much dataset is needed at least to train on your version?

begeekmyfriend commented 6 years ago

I find that different versions have different quantities of dataset needed to train into convergence https://github.com/Rayhane-mamah/Tacotron-2/issues/35. I wonder how much is the least on your implementation?

rafaelvalle commented 6 years ago

We've been able to train with datasets of ~10h. With smaller datasets, data augmentation, e.g. using different segments of the same file, and possibly regularization can help if the model is not generalizing...

begeekmyfriend commented 6 years ago

Well, actually I am comparing two projects of Ito's and Mamah's to find out the key modification that would decide the quantity of dataset. I will make series of modifications step by step to make Ito's training into convergence and Mamah's into unconvergence on the same 10h length dataset. Just as Mamah said, it must be one of life's mysteries.

rafaelvalle commented 6 years ago

Thank you for doing this and please share your conclusions with the TTS community on these repos. In addition to model architecture, parameters and weight initialization, make sure you take a look at differences in data pre-processing, e.g. reducing amplitude variability in the dataset by scaling each wav file by its max value, etc.

begeekmyfriend commented 6 years ago

I am glad to say I have transplanted some code from Rayhane Mamah's version into Keith Ito's one and it can get to convergence on small dataset (10h) within 8K steps. I will release my fork in several days since I need to find the way in minimal modification. I will notify you as well as Keith Ito at that time. But I think I can show some clues about the modification that the architecture seems a little different with Keith Ito's one https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-375676828. The AttentionWrapper class in Tensorflow is inappropriate for Tacotron model since the query of attention should be the hidden states of 2-layer decoder LSTM while in AttentionWrapper the default query is set as the hidden state of extra wrapped attention RNN. The attention RNN is redundant because it does the same thing that decoder LSTM does. Moreover the architecture mentioned in Tacotron paper shows that the query should be the hidden states of decoder LSTM. That is why Rayhane Mamah substitutes TacotronDecoderCell for AttentionWrapper that can help get to convergence.

rafaelvalle commented 6 years ago

Thanks for sharing these! We'll do some comparisons and update this repo accordingly.

begeekmyfriend commented 6 years ago

Two more points: First, expand the range of distribution of mel samples. As we can see in https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382637788, Rayhane expanded the range of mel points and make the distribution symmetric. That is important for convergence which he did not say. Second, use cumulative attention states (aka. cumulative alignments) for location states instead of the previous state only. That is also an indispensable thing for small dataset convergence. Maybe there are missing points. I am still watching the results until the modification to be released.

rafaelvalle commented 6 years ago

Thanks for sharing these, Leo. While you do the experiments, it would be great if you could also annotate how important each of these things are.

begeekmyfriend commented 6 years ago

I have conducted series of ablation studies to find the reasons and here is the modification based on Keith Ito's version https://github.com/keithito/tacotron/issues/170. It confirms me that the architecture of decoder and the normalization of mel-spectrograms are the key components which may determinate whether we can use less dataset to get to convergence or not.

rafaelvalle commented 6 years ago

Thanks a lot for doing this work, @begeekmyfriend. I think that the information loss from the dropout in the prenet layer can be the main reason for slow convergence of attention.

FYI: in a personal communication, the Tacotron 2 authors have told me that "The attention in Taco2 is also stateful. It uses the first decoder lstm output as the query vector.". In the paper they say that "We found that the pre-net acting as an information bottleneck was essential for learning attention."

Are you interested in putting a PR with this modifications to this repo?

rafaelvalle commented 6 years ago

@begeekmyfriend please pull from the branch and let us know what your convergence time is It attends to the full mel instead of the prenet drop out mel: this should increase convergence time considerably. https://github.com/NVIDIA/tacotron2/tree/attention_full_mel

begeekmyfriend commented 6 years ago

I am not familiar with pytorch yet. My code is based on tensorflow now...

rafaelvalle commented 6 years ago

Closing this issue given that in our implementation attending to the decoder's output instead of the drop'ed out mels produces the same effect. Please re-open it if there are things to be discussed.

NVIDIA / tacotron2

How much dataset is needed at least to train on your version? #12