Closed jefflai108 closed 2 years ago
I assume that you already know how to prepare recipe for ESPnet2.
If you can prepare phonemized text, it is easy, just preparing data/hogehoge/text
with phonemized sentence split by space.
(See the following example)
https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#supported-text-frontend
Then you can use token_type=phn
and g2p=none
and cleaner=none
.
great, thanks for the prompt response @kan-bayashi ! this is also what I had in mind!!
hi @kan-bayashi A follow-up: I trained up a Transformer-TTS model based on my own (English) phone set. However, the synthesized speech sounds terrible. I did not change anything about the training/decoding/model configurations. In your experience, do you think this is due to the new phone set itself, or should I try tuning the training/decoding configurations for this new phone set?
thank you!
Let me confirm some points:
data/<your_train_set>/text
, I may give you some insight.Thanks for the prompt response! Let me check the attention plot and training with standard G2P. More specifically, I am training on English reduced phone set without punctuations, while the default is standard phone set with punctuations. Below are examples of my phone transcript. One speculation I have is that given the reduced annotation, the model needs to be trained longer.
LJ050-0271 DH AH D IH M AE N D Z AA N DH AH P R EH Z AH D EH N T IH N DH AH EH K S AH K Y UW SH AH N AH V HH IH Z R IY S P AA N S AH B IH L AH T IY Z IH N T AH D EY Z W ER L D AA R S OW V EH R IY D AH N D K AA M P L EH K S
LJ050-0272 AH N D DH AH T R AH D IH SH AH N Z AH V DH AH AO F AH S IH N AH D IH M AA K R AH S IY S AH CH AE Z AW ER Z AA R S OW D IY P S IY T AH D AE Z T UW P R IH K L UW D AE B S AH L UW T S IH K Y UH R AH T IY
LJ050-0273 DH AH K AH M IH SH AH N HH AE Z HH AW EH V ER F R AH M IH T S IH G Z AE M AH N EY SH AH N AH V DH AH F AE K T S AH V P R EH Z AH D EH N T K EH N AH D IY Z AH S AE S AH N EY SH AH N
LJ050-0274 M EY D S ER T AH N R EH K AH M AH N D EY SH AH N Z W IH CH IH T B IH L IY V Z W UH D IH F AH D AA P T AH D
LJ050-0275 M AH T IH R IY AH L IY IH M P R UW V AH P AA N DH AH P R AH S IY JH ER Z IH N IH F EH K T AE T DH AH T AY M AH V P R EH Z AH D EH N T K EH N AH D IY Z AH S AE S AH N EY SH AH N AH N D R IH Z AH L T IH N AH S AH B S T AE N CH AH L L EH S AH N IH NG AH V DH AH D EY N JH ER
LJ050-0277 W IH DH DH AH AE K T IH V K OW AA P ER EY SH AH N AH V DH AH R IY S P AA N S AH B AH L EY JH AH N S IY Z AH N D W IH DH DH AH AH N D ER S T AE N D IH NG AH V DH AH P IY P AH L AH V DH AH Y UW N AY T AH D S T EY T S IH N DH EH R D IH M AE N D Z AH P AA N DH EH R P R EH Z AH D EH N T
Thank you for sharing. The format itself seems fine but sound a bit difficult setting. If the model cannot get diagonal attentions, maybe you can tune parameters related to guided attention loss (especially lambda?). https://github.com/espnet/espnet/blob/1b248b0d74fb0e8f22d3894292d0f4838dd5a626/egs2/ljspeech/tts1/conf/tuning/train_transformer.yaml#L49-L54
hi @kan-bayashi
I assume I should increase the weighting for guided_attn_loss_lambda
if the model does not pick up diagonal attention?
Right.
hi @kan-bayashi
another follow-up regarding guided attention loss. Should I expect all the attention heads to have diagonal attention (i.e. can i set num_heads_applied_guided_attn=8
and num_layers_applied_guided_attn=6
)?
an update on this issue: I was able to successfully train up a Tacotron2 on my English phone sets. The decode has clear diagonal attention as well. Therefore, indeed Transformer-TTS training is the issue.
Should I expect all the attention heads to have diagonal attention (i.e. can i set num_heads_applied_guided_attn=8 and num_layers_applied_guided_attn=6)?
I think all heads should not need to be diagonal. This is an example of attention weight in decoding.
hi @kan-bayashi got it, in this case, let me try simply increasing the weighing for guided attention loss.
hi @kan-bayashi just a follow-up on this thread -- I found that forcing all Transformer attention heads to be diagonal does make the synthesized waveforms sound more interpretable. I also tried increasing the attention forcing weight over [15, 20, 25] (default is 10) as you suggested, but they do not sound as good as simply forcing all heads to be diagonal with the default weight. Below is the attention weight in decoding:
btw I also agree with you that different attention heads should display different patterns (as I've observed in ASR too). Therefore, I am surprised that this works. Any possible explanation?
Thank you for sharing interesting observations. Actually I have never tried the case where all heads to be diagonal. Maybe in Transformer both self and source target attentions are repeated, and self attention part will consider the more broad local context. In other words, the number of diagonal heads should be at least one to generate reasonable speech, but more diagonal heads have no so much bad effect?
not sure if there's any study on this.
Hi,
What's the best way if I want to train a TTS model from scratch with my own phone sequence (not with the provided G2P)?
thanks in advance!