Training TTS model on provided phone sequence

jefflai108 commented 2 years ago

Hi,

What's the best way if I want to train a TTS model from scratch with my own phone sequence (not with the provided G2P)?

thanks in advance!

kan-bayashi commented 2 years ago

I assume that you already know how to prepare recipe for ESPnet2. If you can prepare phonemized text, it is easy, just preparing data/hogehoge/text with phonemized sentence split by space. (See the following example) https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#supported-text-frontend Then you can use token_type=phn and g2p=none and cleaner=none.

jefflai108 commented 2 years ago

great, thanks for the prompt response @kan-bayashi ! this is also what I had in mind!!

jefflai108 commented 2 years ago

hi @kan-bayashi A follow-up: I trained up a Transformer-TTS model based on my own (English) phone set. However, the synthesized speech sounds terrible. I did not change anything about the training/decoding/model configurations. In your experience, do you think this is due to the new phone set itself, or should I try tuning the training/decoding configurations for this new phone set?

thank you!

kan-bayashi commented 2 years ago

Let me confirm some points:

Could you check the attention plot of training and decoding? Does they include diagonal one?
Does it work well when using existing g2p instead of your introduced phone? If you provide the example of data/<your_train_set>/text, I may give you some insight.

jefflai108 commented 2 years ago

Thanks for the prompt response! Let me check the attention plot and training with standard G2P. More specifically, I am training on English reduced phone set without punctuations, while the default is standard phone set with punctuations. Below are examples of my phone transcript. One speculation I have is that given the reduced annotation, the model needs to be trained longer.

LJ050-0271 DH AH D IH M AE N D Z AA N DH AH P R EH Z AH D EH N T IH N DH AH EH K S AH K Y UW SH AH N AH V HH IH Z R IY S P AA N S AH B IH L AH T IY Z IH N T AH D EY Z W ER L D AA R S OW V EH R IY D AH N D K AA M P L EH K S

LJ050-0272 AH N D DH AH T R AH D IH SH AH N Z AH V DH AH AO F AH S IH N AH D IH M AA K R AH S IY S AH CH AE Z AW ER Z AA R S OW D IY P S IY T AH D AE Z T UW P R IH K L UW D AE B S AH L UW T S IH K Y UH R AH T IY

LJ050-0273 DH AH K AH M IH SH AH N HH AE Z HH AW EH V ER F R AH M IH T S IH G Z AE M AH N EY SH AH N AH V DH AH F AE K T S AH V P R EH Z AH D EH N T K EH N AH D IY Z AH S AE S AH N EY SH AH N

LJ050-0274 M EY D S ER T AH N R EH K AH M AH N D EY SH AH N Z W IH CH IH T B IH L IY V Z W UH D IH F AH D AA P T AH D

LJ050-0275 M AH T IH R IY AH L IY IH M P R UW V AH P AA N DH AH P R AH S IY JH ER Z IH N IH F EH K T AE T DH AH T AY M AH V P R EH Z AH D EH N T K EH N AH D IY Z AH S AE S AH N EY SH AH N AH N D R IH Z AH L T IH N AH S AH B S T AE N CH AH L L EH S AH N IH NG AH V DH AH D EY N JH ER

LJ050-0277 W IH DH DH AH AE K T IH V K OW AA P ER EY SH AH N AH V DH AH R IY S P AA N S AH B AH L EY JH AH N S IY Z AH N D W IH DH DH AH AH N D ER S T AE N D IH NG AH V DH AH P IY P AH L AH V DH AH Y UW N AY T AH D S T EY T S IH N DH EH R D IH M AE N D Z AH P AA N DH EH R P R EH Z AH D EH N T

kan-bayashi commented 2 years ago

Thank you for sharing. The format itself seems fine but sound a bit difficult setting. If the model cannot get diagonal attentions, maybe you can tune parameters related to guided attention loss (especially lambda?). https://github.com/espnet/espnet/blob/1b248b0d74fb0e8f22d3894292d0f4838dd5a626/egs2/ljspeech/tts1/conf/tuning/train_transformer.yaml#L49-L54

jefflai108 commented 2 years ago

hi @kan-bayashi I assume I should increase the weighting for guided_attn_loss_lambda if the model does not pick up diagonal attention?

kan-bayashi commented 2 years ago

Right.

jefflai108 commented 2 years ago

hi @kan-bayashi another follow-up regarding guided attention loss. Should I expect all the attention heads to have diagonal attention (i.e. can i set num_heads_applied_guided_attn=8 and num_layers_applied_guided_attn=6)?

jefflai108 commented 2 years ago

an update on this issue: I was able to successfully train up a Tacotron2 on my English phone sets. The decode has clear diagonal attention as well. Therefore, indeed Transformer-TTS training is the issue.

kan-bayashi commented 2 years ago

Should I expect all the attention heads to have diagonal attention (i.e. can i set num_heads_applied_guided_attn=8 and num_layers_applied_guided_attn=6)?

I think all heads should not need to be diagonal. This is an example of attention weight in decoding.

jefflai108 commented 2 years ago

hi @kan-bayashi got it, in this case, let me try simply increasing the weighing for guided attention loss.

jefflai108 commented 2 years ago

hi @kan-bayashi just a follow-up on this thread -- I found that forcing all Transformer attention heads to be diagonal does make the synthesized waveforms sound more interpretable. I also tried increasing the attention forcing weight over [15, 20, 25] (default is 10) as you suggested, but they do not sound as good as simply forcing all heads to be diagonal with the default weight. Below is the attention weight in decoding: LJ018-0378

btw I also agree with you that different attention heads should display different patterns (as I've observed in ASR too). Therefore, I am surprised that this works. Any possible explanation?

kan-bayashi commented 2 years ago

Thank you for sharing interesting observations. Actually I have never tried the case where all heads to be diagonal. Maybe in Transformer both self and source target attentions are repeated, and self attention part will consider the more broad local context. In other words, the number of diagonal heads should be at least one to generate reasonable speech, but more diagonal heads have no so much bad effect?

jefflai108 commented 2 years ago

not sure if there's any study on this.

espnet / espnet

Training TTS model on provided phone sequence #4109