Closed begeekmyfriend closed 6 years ago
We've been able to train with datasets of ~10h. With smaller datasets, data augmentation, e.g. using different segments of the same file, and possibly regularization can help if the model is not generalizing...
Well, actually I am comparing two projects of Ito's and Mamah's to find out the key modification that would decide the quantity of dataset. I will make series of modifications step by step to make Ito's training into convergence and Mamah's into unconvergence on the same 10h length dataset. Just as Mamah said, it must be one of life's mysteries.
Thank you for doing this and please share your conclusions with the TTS community on these repos. In addition to model architecture, parameters and weight initialization, make sure you take a look at differences in data pre-processing, e.g. reducing amplitude variability in the dataset by scaling each wav file by its max value, etc.
I am glad to say I have transplanted some code from Rayhane Mamah's version into Keith Ito's one and it can get to convergence on small dataset (10h) within 8K steps. I will release my fork in several days since I need to find the way in minimal modification. I will notify you as well as Keith Ito at that time. But I think I can show some clues about the modification that the architecture seems a little different with Keith Ito's one https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-375676828. The AttentionWrapper
class in Tensorflow is inappropriate for Tacotron model since the query of attention should be the hidden states of 2-layer decoder LSTM while in AttentionWrapper
the default query is set as the hidden state of extra wrapped attention RNN. The attention RNN is redundant because it does the same thing that decoder LSTM does. Moreover the architecture mentioned in Tacotron paper shows that the query should be the hidden states of decoder LSTM. That is why Rayhane Mamah substitutes TacotronDecoderCell for AttentionWrapper
that can help get to convergence.
Thanks for sharing these! We'll do some comparisons and update this repo accordingly.
Two more points: First, expand the range of distribution of mel samples. As we can see in https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382637788, Rayhane expanded the range of mel points and make the distribution symmetric. That is important for convergence which he did not say. Second, use cumulative attention states (aka. cumulative alignments) for location states instead of the previous state only. That is also an indispensable thing for small dataset convergence. Maybe there are missing points. I am still watching the results until the modification to be released.
Thanks for sharing these, Leo. While you do the experiments, it would be great if you could also annotate how important each of these things are.
I have conducted series of ablation studies to find the reasons and here is the modification based on Keith Ito's version https://github.com/keithito/tacotron/issues/170. It confirms me that the architecture of decoder and the normalization of mel-spectrograms are the key components which may determinate whether we can use less dataset to get to convergence or not.
Thanks a lot for doing this work, @begeekmyfriend. I think that the information loss from the dropout in the prenet layer can be the main reason for slow convergence of attention.
FYI: in a personal communication, the Tacotron 2 authors have told me that "The attention in Taco2 is also stateful. It uses the first decoder lstm output as the query vector.". In the paper they say that "We found that the pre-net acting as an information bottleneck was essential for learning attention."
Are you interested in putting a PR with this modifications to this repo?
@begeekmyfriend please pull from the branch and let us know what your convergence time is It attends to the full mel instead of the prenet drop out mel: this should increase convergence time considerably. https://github.com/NVIDIA/tacotron2/tree/attention_full_mel
I am not familiar with pytorch yet. My code is based on tensorflow now...
Closing this issue given that in our implementation attending to the decoder's output instead of the drop'ed out mels produces the same effect. Please re-open it if there are things to be discussed.
I find that different versions have different quantities of dataset needed to train into convergence https://github.com/Rayhane-mamah/Tacotron-2/issues/35. I wonder how much is the least on your implementation?